learning social networks from web documents using support

1

Learning Social Networks from Web Documents Using Support Vectors Classifiers

IEEE, WI’06

Masoud Makrehchi & Mohamed S. Kamel

Presenter: Teng-Kai Fan

Date: 2008-11-18

2

Abstract

Learning social network from incomplete relationship data.

Translating social network extractions into a text classification problem.

SVM (Support Vector Machine)

FOAF (Friend Of A Friend) dataset & F-measure.

3

Outline

Introduction Related Work Problem Statement Proposed approach: Learning Social N

etwork from Incomplete Network. Experiment Conclusion

4

Introduction

A social network is defined as a map of relationship (tie) between individuals (actors).

Applications: Marketing, Advertising. Finding friends.

5

Introduction cont.

In this study, they proposed an approach to generate a social network from a collection of web documents.

Actor-term matrix: every person can be represented by her corresponding documents.

Learning social relation from actor-term database. Assumption: the social network is partially explored

(training dataset). The support vector classifier is employed to extract the

missing relations to complete social network.

6

Related Work

The social network models can be constructed either directly or indirectly.

Direct (descriptive): the concept of acquaintanceship can be extracted from information. e-mail, cited paper, relational database and web page

link…etc.

Indirect (predictive): acquaintanceship is translated into the similarity of two actors. paper, opinions, news…etc.

7

Problem Statement The goal is to predict and learn the network

while knowing only a small number of relations between individual persons.

Social networks are represented either by graphs or matrices or adjacency matrix.

Incomplete matrix (training examples)

Complete matrix (learned matrices)

8

Learning Social Network from Incomplete Network

Two assumptions: A subset of relations represented by adjacency

matrix. The textual data associated to the actors.

Three steps: Modeling the actors in the social network. Modeling the relations between the actors. Training a classifier to learn the social network.

9

Actor Modeling

Each actor is represented by her web documents including home page, blog, CV and so on.

All document associated with an individual are merged together to build a unique document vector. Each document is associated to one actor.

where the weighting schema is

10

Actor Modeling cont. Consequently, the corpus is modeled by a

matrix called Actor-Term Matrix.

Dimension reduction: Stemming and stop-word list. DF (document frequency): terms with DF less

than 5 and more than 100 were removed.

tf*idf

Actor

Term

11

Relationship Modeling

One simple approach to model the relation between two actors is to estimate the similarity of their documents vector. The similarity measure (e.g., cosine, Jaccard and Correlati

on) offers very poor results because it models each relation with only one variable.

A better approach is to aggregate the documents vector of the actors in both sides of the relation and create new aggregated document vector.

12

Relationship Modeling cont.

Let di and dj be the document vectors associated to the actor ai and aj.

The relation between two actors are modeled by aggregating their vectors by an operator such as MIN, MAX, or Product.

The aggregated document vector (relation vector) is obtained as follows:

13

Classifier Design for Imbalance Social Network Data

Imbalance social network data The social network is sparse:

A common approach to dealing with class imbalance is to artificially re-balance the training data. Up-sampling the minority class. Down-sampling the majority class.

n: # of actorsr: # of relations

14

Classifier Design for Imbalance Social Network Data cont.

An SVM classifier with linear kernel is used for learning the social network. Learning social network is a binary class

problem with two classes including positive (connected) and negative (broken).

15

Experiments

Evaluation measures: Precision, Recall and F-measure.

Two-fold cross validation. Dataset: a real FOAF database contains 21

0,611 RDF triples. Relations between the individuals: a set of true s

ocial networks. Any web resource address and URLs related to t

he individuals.

16

Dataset cont.

All social network: Actors: 34,275 Real Ties: 33,419 Possible relationship: 587,370,675 Ratio: 1:17575

Down-sampling: remove with less than 20 and more than 70 members social networks.

After breaking the database into small sub-graphs: Actors: 254 Real ties: 246 Possible relationship: 32,131 Ratio: 1:130

17

Results

18

Results cont.

20

Conclusion

A text classification formulation to approximately predict social relations using web documents were proposed. A document vector aggregation model is

proposed instead of document similarity.

Using the down-sampling to deal with high imbalance data.

learning social networks from web documents using support

Technology