text analysis of social networks: working with fb and vk data

56
Data SNA User Gender User Language User Interests User Matching Other tasks References Text Analysis of Social Networks: Working with FB and VK Data AI Ukraine Conference, Kharkiv, Ukraine Alexander Panchenko [email protected] Digital Society Laboratory LLC & UCLouvain October 25, 2014 Alexander Panchenko Text Analysis of Social Networks

Upload: alexander-panchenko

Post on 14-Jun-2015

525 views

Category:

Science


3 download

TRANSCRIPT

Page 1: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Text Analysis of Social NetworksWorking with FB and VK DataAI Ukraine Conference Kharkiv Ukraine

Alexander Panchenkoalexanderpanchenkouclouvainbe

Digital Society Laboratory LLC amp UCLouvain

October 25 2014

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social networks from the usersrsquos standpoint

Facebook (FB) and VKontakte (VK)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social networks from the data minerrsquos standpoint

Facebook (FB) and VKontakte (VK)

Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)

Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc

Textspostscommentsgroup titles and descriptions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gathering of VK and FB data

Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download

APIScraping

Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one

Amazon EC2 Distributed Message Queues

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Storing VK and FB data

Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social Network Analysis

Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches

What scientific communities analyze social networks

60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 2: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social networks from the usersrsquos standpoint

Facebook (FB) and VKontakte (VK)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social networks from the data minerrsquos standpoint

Facebook (FB) and VKontakte (VK)

Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)

Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc

Textspostscommentsgroup titles and descriptions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gathering of VK and FB data

Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download

APIScraping

Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one

Amazon EC2 Distributed Message Queues

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Storing VK and FB data

Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social Network Analysis

Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches

What scientific communities analyze social networks

60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 3: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social networks from the usersrsquos standpoint

Facebook (FB) and VKontakte (VK)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social networks from the data minerrsquos standpoint

Facebook (FB) and VKontakte (VK)

Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)

Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc

Textspostscommentsgroup titles and descriptions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gathering of VK and FB data

Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download

APIScraping

Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one

Amazon EC2 Distributed Message Queues

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Storing VK and FB data

Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social Network Analysis

Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches

What scientific communities analyze social networks

60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 4: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social networks from the usersrsquos standpoint

Facebook (FB) and VKontakte (VK)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social networks from the data minerrsquos standpoint

Facebook (FB) and VKontakte (VK)

Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)

Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc

Textspostscommentsgroup titles and descriptions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gathering of VK and FB data

Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download

APIScraping

Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one

Amazon EC2 Distributed Message Queues

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Storing VK and FB data

Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social Network Analysis

Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches

What scientific communities analyze social networks

60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 5: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social networks from the data minerrsquos standpoint

Facebook (FB) and VKontakte (VK)

Profiles a set of user attributescategorical variables (region city profession etc)integer variables (age graduation year etc)text variables (name surname etc)

Network a graph that relates usersfriendship graphfollowers graphcommenting graph etc

Textspostscommentsgroup titles and descriptions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gathering of VK and FB data

Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download

APIScraping

Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one

Amazon EC2 Distributed Message Queues

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Storing VK and FB data

Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social Network Analysis

Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches

What scientific communities analyze social networks

60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 6: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gathering of VK and FB data

Big Data VK worth tens or even hundreds of TBDecide what do you need (posts profiles etc)Download

APIScraping

Download limits and API limitations are specific for eachnetworkParallelization is very practical especially horizontal one

Amazon EC2 Distributed Message Queues

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Storing VK and FB data

Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social Network Analysis

Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches

What scientific communities analyze social networks

60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 7: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Storing VK and FB data

Again Big DataNoSQL solutions are helpfulRaw data Amazon S3For analysis HDFSEfficient retrieval Elastic Search

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social Network Analysis

Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches

What scientific communities analyze social networks

60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 8: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social Network Analysis

Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches

What scientific communities analyze social networks

60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 9: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Social Network Analysis

Structure analysis friendship graph comments graph etcContent analysis profile attributes posts comments etcCombined approaches

What scientific communities analyze social networks

60s ndash the first structural methods00s ndash online social network analysis boomSocial Network Analysis community (SociologistsStatisticians Physicists)Data and Graph Mining communityNatural Language Processing community

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 10: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Technologies for analysis of social networks

Machine Learning hidden vs observable user attributesTraining of the model often can be scaled vertically

Applying the model should be scaled horizontally

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 11: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 12: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Andrey Teterin

Detect gender of a userto profile a useruser segmentation is helpful in search advertisement etc

By text written by a userCiot et al [2013] Koppel et al [2002] Goswami et al [2009]Mukherjee and Liu [2010] Peersman et al [2011] Rao et al[2010] Rangel and Rosso Rangel and Rosso [2013] Al Zamalet alAl Zamal et al [2012] and Lui et al Liu et al [2012]

By full name Burger et al [2011] Panchenko and Teterin[2014]

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 13: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Online demo

httpresearchdigsolabcomgender

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 14: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

100000 full names of Facebook users with known genderfull name ndash first and last name of a usergender male or femalenames written in both Cyrillic and Latin alphabetsldquoAlexander Ivanovrdquo ldquoMasha Sidorovardquo ldquoPavel Nikolenkordquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 15: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Training Data

Figure Name-surname co-occurrences rows and columns are sorted byfrequency

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 16: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

72 of first names have typical malefemale ending68 of surnames have typical malefemale endinga typical malefemale ending splits males from females with anerror less than 5gender of ge 50 first names recognized with 8 endingsgender of ge 50 second names recognized with 5 endings

Conclusion

Simple symbolic ending-based method cannot robustly classifyabout 30 of names This motivates the need for a moresophisticated statistical approach

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 17: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Character endings of Russian names

Type Ending Gender Error Examplefirst name na (на) female 027 Ekaterinafirst name iya (ия) female 032 Anastasiyafirst name ei (ей) male 016 Sergeifirst name dr (др) male 000 Alexandrfirst name ga (га) male 494 Seregafirst name an (ан) male 499 Ivanfirst name la (ла) female 423 Luidmilafirst name ii (ий) male 034 Yuriisecond name va (ва) female 028 Morozovasecond name ov (ов) male 021 Objedkovsecond name na (на) female 222 Matyushinasecond name ev (ев) male 044 Sergeevsecond name in (ин) male 194 Teterin

Table Most discriminative and frequent two character endings ofRussian names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 18: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Gender Detection Method

input a string representing a name of a personoutput gender (male or female)binary classification task

Features

endingscharacter n-gramsdictionary of malefemale names and surnames

Model

L2-regularized Logistic Regression

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 19: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Features

Character n-grams

males Alexander Yaroskavski Oleg Arbuzovfemales Alexandra Yaroskavskaya Nayaliya ArbuzovaBUT ldquoSidorenkordquo ldquoMorozrdquo or ldquoBondarrdquotwo most common one-character endings ldquoardquo and ldquoyardquo (ldquoяrdquo)

Dictionaries of first and last names

probability that it belongs to the male genderP(c = male|firstname) P(c = male|lastname)3427 first names 11411 last names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 20: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Model Accuracy Precision Recall F-measurerule-based baseline 0638 0995 0633 0774endings 0850 plusmn 0002 0921 plusmn 0003 0784 plusmn 0004 0847 plusmn 00023-grams 0944 plusmn 0003 0948 plusmn 0003 0946 plusmn 0003 0947 plusmn 0003dicts 0956 plusmn 0002 0992 plusmn 0001 0925 plusmn 0003 0957 plusmn 0002endings+3-grams 0946 plusmn 0003 0950 plusmn 0002 0947 plusmn 0004 0949 plusmn 00033-grams+dicts 0956 plusmn 0003 0960 plusmn 0003 0957 plusmn 0004 0959 plusmn 0003endings+3-grams+dicts 0957 plusmn 0003 0961 plusmn 0003 0959 plusmn 0004 0960 plusmn 0002

Table Results of the experiments on the training set of 10000 namesHere endings ndash 4 Russian female endings trigrams ndash 1000 most frequent3-grams dictionary ndash namesurname dict This table presents precisionrecall and F-measure of the female class

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 21: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Learning curves of single and combined models Accuracy wasestimated on separate sample of 10000 names

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 22: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 23: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Motivation

Goal to detect Russian-speaking usersCyrillic alphabet is used also by Ukrainian BelorussianBulgarian Serbian Macedonian Kazakh etc

Research Questions

Which method is the best for Russian languageHow to adopt it to the FB profile

Contributions

Comparison of Russian-enabled language detection modulesA technique for identification of Russian-speaking users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 24: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

input a FB user profileoutput is Russian-speaker (or a set of languages user speaks)

Common Russian character trigrams

на пр то не ли по но в на ть не и ко ом про то их ка атьото за ие ова тел тор де ой сти от ах ми стр бе во ра ая ват ей ет же иче ия ов сто об вер го ив и п и с ии ист о в ост тра те елиере кот льн ник нти о с

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 25: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Existing modules for language identification

langidpyhttpsgithubcomsaffsdlangidpyAdvanced n-gram selection

chromium compact language detector (cld)httpscodegooglecompchromium-compact-language-detector

guess-languagehttpscodegooglecompguess-language

Google Translate APIhttpsdevelopersgooglecomtranslatev2using_restdetect-language20$1M characters

Yandex Translate APIhttpapiyandexrutranslateFree of charge 1M of characters day (by September 2013)

Many more eg language-detection for JavaAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 26: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

DBpedia Dataset

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 27: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Accuracy of Different Language Detection Modules

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 28: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Method

Profile text posts + comments + user names ndash LatinsymbolsProfile text length 3367 +- 17540Russian-speakers P(ru) gt 095Core Russian-speakers

P(ru) gt 095 Cyrillic symbols gt= 20locale is ru_RU

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 29: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Facebook Dataset Results

9906524 public FB profiles (gt= 50 cyr characters)8687915 (88) Russian-speaking users3190813 (32) core Russian-speaking public Facebook users5365691 (54) of profiles with no profile text (lt= 200characters)

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 30: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 31: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Sergei Objedkov

input some SN data representing a useroutput list of user interests

Motivation

Advertisement targeting user segmentation etcRecommendations of content and friendsCustomization of user experience

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 32: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data FB and VK groups

Text corpus

41 million of VK groups11 million of FB publics15 million of FB groups

Data format

Title andor descriptionList of membersNumber of comments likes posts by a member

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 33: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Data VK groups

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 34: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

253 interests detected by our system

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 35: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Method

1 Create a text index of groups2 Create a keyword list for each of 253 interests3 KW classifier

Retrieve top k groups retrieved by a set interest keywordsRank by TF-IDFAssociate grouprsquos interests with its usersA group may have multiple interests

4 ML classifierUse top k groups as a training dataBOW featuresKeyword featuresLinear models L2 LR Liner SVM NBClassify all groupsA group may have up to three top interestsAssociate grouprsquos interests with its users

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 36: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Association of grouprsquos interests with its users

Engagement of a person into an interest category isproportional to the activity of the person in groups of this category

e asymp wlike middot l + wscomm middot cs + wl comm middot cl + wrepost middot r

l ndash the number of post likescs ndash the number of short commentscl ndash the number of long commentsr ndash the number of reposts

Association score of a user and an interest depends onengagement in a group and on the number of groups

all asymp α middot efb middot gfb + β middot evk middot gvk

evk efb ndash engagement into VKFB interestgvk gfb ndash number of groups a user has in FBVK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 37: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 38: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results per category the best and the worst

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 39: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 40: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Intersection of the top 30 interests on FB and VK

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 41: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Interests co-occurrences

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 42: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 43: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Problem

Joint work with Dmitry Babaev and Segei Objedkov

Motivation

input a user profile of one social networkoutput profile of the same person in another social networkimmediate applications in marketing search security etc

Contribution

user identity resolution approachprecision of 098 and recall of 054the method is computationally effective and easily parallelizable

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 44: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Dataset

VK FacebookNumber of users in our dataset 89561085 2903144Number of users in Russia 1 100000000 13000000User overlap 29 88

training set 92488 matched FB-VK profiles

1According to comScore and httpvkcomaboutAlexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 45: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Profile matching algorithm

1 Candidate generation For each VK profile we retrieve a setof FB profiles with similar first and second names

2 Candidate ranking The candidates are ranked according tosimilarity of their friends

3 Selection of the best candidate The goal of the final stepis to select the best match from the list of candidates

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 46: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate generation

Retrieve FB users with names similar to the input VK profileTwo names are similar if the first letters are the same and theedit distance between names le 2Levenshtein Automata for fuzzy match between a VK username and all FB user namesAutomatically extracted dictionary of name synonyms

ldquoAlexanderrdquo ldquoSashardquo ldquoSanyardquo ldquoSanekrdquo etc

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 47: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles the greater the similarity of these profilesTwo friends are considered to be similar if

First two letters of their last names matchSimilarity between firstlast names sims are greater thanthresholds α β

sims(si sj) = 1minus lev(si sj)max(|si | |sj |)

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency

simp(pvk pfb) =sum

j sims(sfi s

fj )gtαandsims(ss

i ssj )gtβ

min(1N

|s fj | middot |ss

j |)

Here s fi and ss

i are first and second names of a VK profilecorrespondingly while s f

j and ssj refer to a FB profile

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 48: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match

its score should be higher than the score threshold γ

simp(pvk pfb) gt γ

either the only candidate or score ratio between it and the nextbest candidate pprime

fb should be higher than the ratio threshold δ

simp(pvk pfb)

simp(pvk pprimefb)

gt δ

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 49: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results

Figure Precision-recall plot of the matching method The bold linedenotes the best precision at given recall

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 50: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Results matching VK and FB profiles

First name threshold α 08Second name threshold β 06Profile score threshold γ 3Profile ratio threshold δ 5Number of matched profiles 644334 (22)Expected precision 098Expected recall 054

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 51: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Outline

1 Social Network Data

2 Social Network Analysis

3 User Gender Detection

4 User Language Detection

5 User Interests Detection

6 VK-FB User Matching

7 Other SNA Tasks

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 52: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Much more fun stuff can be done with the FBVK data

User Age amp Region DetectionTell me who are your friends and I will say who you areMost frequent ageregion of friendsReject users with high variation of ageregion among friendsUp to 85-90 of precision

User Income DetectionTransfer learning target variable is not present in SNsTraining a model on a set of users with known incomeApplying the model on the social network profiles

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 53: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Thank you Questions

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 54: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Al Zamal F Liu W and Ruths D (2012) Homophily and latentattribute inference Inferring latent attributes of twitter usersfrom neighbors In ICWSM

Burger J D Henderson J Kim G and Zarrella G (2011)Discriminating gender on twitter In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing pages 1301ndash1309 Association for ComputationalLinguistics

Ciot M Sonderegger M and Ruths D (2013) Gender inferenceof twitter users in non-english contexts In Proceedings of the2013 Conference on Empirical Methods in Natural LanguageProcessing Seattle Wash pages 18ndash21

Goswami S Sarkar S and Rustagi M (2009) Stylometricanalysis of bloggersrsquo age and gender In Third International AAAIConference on Weblogs and Social Media

Koppel M Argamon S and Shimoni A R (2002)Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 55: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Automatically categorizing written texts by author genderLiterary and Linguistic Computing 17(4)401ndash412

Liu W Zamal F A and Ruths D (2012) Using social media toinfer gender composition of commuter populations Proceedingsof the When the City Meets the Citizen Worksop

Mukherjee A and Liu B (2010) Improving gender classificationof blog authors In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing pages207ndash217 Association for Computational Linguistics

Peersman C Daelemans W and Van Vaerenbergh L (2011)Predicting age and gender in online social networks InProceedings of the 3rd international workshop on Search andmining user-generated contents pages 37ndash44 ACM

Rangel F and Rosso P (2013) Use of language and authorprofiling Identification of gender and age Natural LanguageProcessing and Cognitive Science page 177

Alexander Panchenko Text Analysis of Social Networks

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks
Page 56: Text Analysis of Social Networks: Working with FB and VK Data

Data SNA User Gender User Language User Interests User Matching Other tasks References

Rao D Yarowsky D Shreevats A and Gupta M (2010)Classifying latent user attributes in twitter In Proceedings of the2nd international workshop on Search and mining user-generatedcontents pages 37ndash44 ACM

Alexander Panchenko Text Analysis of Social Networks

  • Social Network Data
  • Social Network Analysis
  • User Gender Detection
  • User Language Detection
  • User Interests Detection
  • VK-FB User Matching
  • Other SNA Tasks