learning social networks from web documents using support
DESCRIPTION
TRANSCRIPT
1
Learning Social Networks from Web Documents Using Support Vectors Classifiers
IEEE, WI’06
Masoud Makrehchi & Mohamed S. Kamel
Presenter: Teng-Kai Fan
Date: 2008-11-18
2
Abstract
Learning social network from incomplete relationship data.
Translating social network extractions into a text classification problem.
SVM (Support Vector Machine)
FOAF (Friend Of A Friend) dataset & F-measure.
3
Outline
Introduction Related Work Problem Statement Proposed approach: Learning Social N
etwork from Incomplete Network. Experiment Conclusion
4
Introduction
A social network is defined as a map of relationship (tie) between individuals (actors).
Applications: Marketing, Advertising. Finding friends.
5
Introduction cont.
In this study, they proposed an approach to generate a social network from a collection of web documents.
Actor-term matrix: every person can be represented by her corresponding documents.
Learning social relation from actor-term database. Assumption: the social network is partially explored
(training dataset). The support vector classifier is employed to extract the
missing relations to complete social network.
6
Related Work
The social network models can be constructed either directly or indirectly.
Direct (descriptive): the concept of acquaintanceship can be extracted from information. e-mail, cited paper, relational database and web page
link…etc.
Indirect (predictive): acquaintanceship is translated into the similarity of two actors. paper, opinions, news…etc.
7
Problem Statement The goal is to predict and learn the network
while knowing only a small number of relations between individual persons.
Social networks are represented either by graphs or matrices or adjacency matrix.
Incomplete matrix (training examples)
Complete matrix (learned matrices)
8
Learning Social Network from Incomplete Network
Two assumptions: A subset of relations represented by adjacency
matrix. The textual data associated to the actors.
Three steps: Modeling the actors in the social network. Modeling the relations between the actors. Training a classifier to learn the social network.
9
Actor Modeling
Each actor is represented by her web documents including home page, blog, CV and so on.
All document associated with an individual are merged together to build a unique document vector. Each document is associated to one actor.
where the weighting schema is
10
Actor Modeling cont. Consequently, the corpus is modeled by a
matrix called Actor-Term Matrix.
Dimension reduction: Stemming and stop-word list. DF (document frequency): terms with DF less
than 5 and more than 100 were removed.
tf*idf
Actor
Term
11
Relationship Modeling
One simple approach to model the relation between two actors is to estimate the similarity of their documents vector. The similarity measure (e.g., cosine, Jaccard and Correlati
on) offers very poor results because it models each relation with only one variable.
A better approach is to aggregate the documents vector of the actors in both sides of the relation and create new aggregated document vector.
12
Relationship Modeling cont.
Let di and dj be the document vectors associated to the actor ai and aj.
The relation between two actors are modeled by aggregating their vectors by an operator such as MIN, MAX, or Product.
The aggregated document vector (relation vector) is obtained as follows:
13
Classifier Design for Imbalance Social Network Data
Imbalance social network data The social network is sparse:
A common approach to dealing with class imbalance is to artificially re-balance the training data. Up-sampling the minority class. Down-sampling the majority class.
n: # of actorsr: # of relations
14
Classifier Design for Imbalance Social Network Data cont.
An SVM classifier with linear kernel is used for learning the social network. Learning social network is a binary class
problem with two classes including positive (connected) and negative (broken).
15
Experiments
Evaluation measures: Precision, Recall and F-measure.
Two-fold cross validation. Dataset: a real FOAF database contains 21
0,611 RDF triples. Relations between the individuals: a set of true s
ocial networks. Any web resource address and URLs related to t
he individuals.
16
Dataset cont.
All social network: Actors: 34,275 Real Ties: 33,419 Possible relationship: 587,370,675 Ratio: 1:17575
Down-sampling: remove with less than 20 and more than 70 members social networks.
After breaking the database into small sub-graphs: Actors: 254 Real ties: 246 Possible relationship: 32,131 Ratio: 1:130
17
Results
18
Results cont.
19
20
Conclusion
A text classification formulation to approximately predict social relations using web documents were proposed. A document vector aggregation model is
proposed instead of document similarity.
Using the down-sampling to deal with high imbalance data.