text analysis of social networks: working with fb and vk data
TRANSCRIPT
Data SNA User Gender User Language User Interests User Matching Other tasks References
Text Analysis of Social NetworksWorking with FB and VK DataAI Ukraine Conference Kharkiv Ukraine
Alexander Panchenkoalexanderpanchenkouclouvainbe
Digital Society Laboratory LLC amp UCLouvain
October 25 2014
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the usersrsquos standpoint
Facebook (FB) and VKontakte (VK)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the data minerrsquos standpoint
Facebook (FB) and VKontakte (VK)
Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)
Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc
Textspostscommentsgroup titles and descriptions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gathering of VK and FB data
Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download
APIScraping
Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one
Amazon EC2 Distributed Message Queues
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Storing VK and FB data
Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches
What scientific communities analyze social networks
60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the usersrsquos standpoint
Facebook (FB) and VKontakte (VK)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the data minerrsquos standpoint
Facebook (FB) and VKontakte (VK)
Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)
Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc
Textspostscommentsgroup titles and descriptions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gathering of VK and FB data
Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download
APIScraping
Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one
Amazon EC2 Distributed Message Queues
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Storing VK and FB data
Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches
What scientific communities analyze social networks
60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the usersrsquos standpoint
Facebook (FB) and VKontakte (VK)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the data minerrsquos standpoint
Facebook (FB) and VKontakte (VK)
Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)
Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc
Textspostscommentsgroup titles and descriptions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gathering of VK and FB data
Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download
APIScraping
Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one
Amazon EC2 Distributed Message Queues
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Storing VK and FB data
Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches
What scientific communities analyze social networks
60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the usersrsquos standpoint
Facebook (FB) and VKontakte (VK)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the data minerrsquos standpoint
Facebook (FB) and VKontakte (VK)
Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)
Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc
Textspostscommentsgroup titles and descriptions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gathering of VK and FB data
Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download
APIScraping
Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one
Amazon EC2 Distributed Message Queues
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Storing VK and FB data
Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches
What scientific communities analyze social networks
60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the data minerrsquos standpoint
Facebook (FB) and VKontakte (VK)
Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)
Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc
Textspostscommentsgroup titles and descriptions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gathering of VK and FB data
Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download
APIScraping
Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one
Amazon EC2 Distributed Message Queues
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Storing VK and FB data
Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches
What scientific communities analyze social networks
60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gathering of VK and FB data
Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download
APIScraping
Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one
Amazon EC2 Distributed Message Queues
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Storing VK and FB data
Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches
What scientific communities analyze social networks
60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Storing VK and FB data
Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches
What scientific communities analyze social networks
60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches
What scientific communities analyze social networks
60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches
What scientific communities analyze social networks
60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin
Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc
By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]
By full name Burger et al [2011] Panchenko and Teterin[2014]
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
httpresearchdigsolabcomgender
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure Name-surname co-occurrences rows and columns are sorted byfrequency
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin
Table Most discriminative and frequent two character endings ofRussian names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input a string representing a name of a personoutput gender (male or female)binary classification task
Features
endingscharacter n-gramsdictionary of malefemale names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)
Dictionaries of first and last names
probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002
Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc
Research Questions
Which method is the best for Russian languageHow to adopt it to the FB profile
Contributions
Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)
Common Russian character trigrams
на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection
chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector
guess-languagehttpscodegooglecompguess-language
Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters
Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)
Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Method
Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers
P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset Results
9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov
input some SN data representing a useroutput list of user interests
Motivation
Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data FB and VK groups
Text corpus
41 million of VK groups11 million of FB publics15 million of FB groups
Data format
Title andor descriptionList of membersNumber of comments likes posts by a member
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Data VK groups
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier
Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests
4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of grouprsquos interests with its users
Engagement of a person into an interest category isproportional to the activity of the person in groups of this category
e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r
l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts
Association score of a user and an interest depends onengagement in a group and on the number of groups
all asymp α middot efb middot gfb + β middot evk middot gvk
evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category the best and the worst
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov
Motivation
input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc
Contribution
user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88
training set 92488 matched FB-VK profiles
1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names
2 Candidate ranking The candidates are ranked according tosimilarity of their friends
3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms
ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if
First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β
sims(si sj) = 1minus lev(si sj)max(|si | |sj |)
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency
simp(pvk pfb) =sum
j sims(sfi s
fj )gtαandsims(ss
i ssj )gtβ
min(1N
|s fj | middot |ss
j |)
Here s fi and ss
i are first and second names of a VK profilecorrespondingly while s f
j and ssj refer to a FB profile
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match
its score should be higher than the score threshold γ
simp(pvk pfb) gt γ
either the only candidate or score ratio between it and the nextbest candidate pprime
fb should be higher than the ratio threshold δ
simp(pvk pfb)
simp(pvk pprimefb)
gt δ
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Results matching VK and FB profiles
First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FBVK data
User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision
User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you Questions
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM
Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics
Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21
Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media
Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412
Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop
Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics
Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM
Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-
Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM
Alexander Panchenko Text Analysis of Social Networks
- Social Network Data
- Social Network Analysis
- User Gender Detection
- User Language Detection
- User Interests Detection
- VK-FB User Matching
- Other SNA Tasks
-