user behavior model & recommendation on basis of social networks

53
American International University - Bangladesh Faculty of Science and Information Technology Department of Computer Science User Behavior Modeling & Recommendation System Based On Social Networks A thesis submitted for the degree of Bachelor of Science in Computer Science and Engineering By: Alam Shah 10-17685-3 Hossain, MD. Shakawat 11-18494-1 Taher, Najeeb Ahmad 11-18198-1 Supervisor: Md. Saddam Hossain Assistant Professor, Department of Computer Science, American International University-Bangladesh Summer 2014

Upload: shah-alam-sabuj

Post on 15-Jul-2015

139 views

Category:

Education


0 download

TRANSCRIPT

American International University - Bangladesh

Faculty of Science and Information Technology

Department of Computer Science

User Behavior Modeling & Recommendation

System Based On Social Networks

A thesis submitted for the degree of

Bachelor of Science in Computer Science and Engineering

By:

Alam Shah

10-17685-3

Hossain, MD. Shakawat

11-18494-1

Taher, Najeeb Ahmad

11-18198-1

Supervisor:

Md. Saddam Hossain

Assistant Professor, Department of Computer Science, American

International University-Bangladesh

Summer 2014

Declaration

This is to certify that this project is our original work. No part of this has been

submitted elsewhere partially or fully for the award of any other degree. Any

material reproduced in this project has been properly acknowledged.

Alam Shah Hossain MD. Shakawat

ID: 10-17685-3 ID: 11-18494-1

Department: CSE Department: CSE

Taher, Najeeb Ahmad

ID: 11-18198-1

Department: CSE

i

Approval

The thesis titled “User Behavior Modeling & Recommendation System Based

On Social Networks” has been submitted to the following respected members of

the Board of Examiners of the Faculty of Science and Information Technology

in partial fulfillment of the requirements for the degree of Bachelor of Science in

Computer Science Engineering and has been accepted satisfactory.

Md. Saddam Hossain

Assistant Professor

Faculty of Computer Science

American International University-Bangladesh

Dr. Dip Nandi

Assistant Professor & Head

Faculty of Computer Science

American International University-Bangladesh

ii

iii

Professor Dr. Tafazzal Hossain

Dean

Faculty of Computer Science

American International University-Bangladesh

Dr. Carmen Z. Lamagna

Vice Chancellor

American International University-Bangladesh

iii

Acknowledgements

Special thanks to our honorable teacher and supervisor Md. Sad-

dam Hossain, Assistant Professor, Department of Computer Science,

American International University-Bangladesh. We are very grateful

to him for giving us the opportunity to work with him. Without his

continuous support, it would be very difficult for us to complete this

work. We would also like to thank all the faculty members for their

guidelines for making proper documentation for our project.

Abstract

At present social networks play an important role to express people’s

sentiment and people’s interest in a particular field. Extracting a

user’s public social network data (what the user shares with friends

and relatives and how the user reacts over others’ thought) means

extracting the user’s behavior. Defining some determined hypothesis

if we make machine understand human sentiment and interest, it is

possible to recommend a user his/her personal interest on basis of

the user’s sentiment analyzed by machine. Our main approach is to

suggest a user regarding the user’s specific interest that is anticipated

by analyzing the user’s public data. This can be extended to further

business analysis to suggest products or services of different companies

depending on the consumer’s personal choice. This automation will

also help to choose the correct candidate for any questionnaire. This

system will also help anyone to know about himself or herself, how

one’s behavior may influence others. It is possible to identify different

types of people such as- dependable people, leadership skilled, people

of supportive mentality, people of negative mentality etc.

Table of Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : vii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : 1

2. Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : 3

2.1 Location Based Social Network. . . . . . . . . . . . . . . . . . ...: 3

2.2 Collaborative Recommendation

Based Social Network. . . . . . . . . . . . . . . . . . . . . . . . . . . ..: 8

2.3 Sentimental Intensity

Analysis of Informal Texts. . . . . . . . . . . . . . . . . . . . . . . . : 12

2.4 Big Five [1] Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . ..: 16

3. Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...: 28

4. Proposed Research Methodology. . . . . . . . . . . . . . . . . . . . . ...: 29

4.1 Data Collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...: 29

4.2 Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .: 31

4.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..: 32

4.4 Recommendation Analysis. . . . . . . . . . . . . . . . . . . . . . . . : 33

5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......: 36

vi

List of Figures

4.1 Modeling User Behavior . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Pie Chart of LIWC Results . . . . . . . . . . . . . . . . . . . . . 32

4.3 Personality Based Recommendation System . . . . . . . . . . . . 33

vii

List of Tables

2.1 Comparison of different location based social networks . . . . . . 7

4.1 Relationship between LIWC categories and Big Five factors . . . 31

4.2 Products under Big Five factors . . . . . . . . . . . . . . . . . . . 34

4.3 Products under Big Five factors . . . . . . . . . . . . . . . . . . . 34

4.4 Products under Big Five factors . . . . . . . . . . . . . . . . . . . 34

4.5 Products under Big Five factors . . . . . . . . . . . . . . . . . . . 35

4.6 Products under Big Five factors . . . . . . . . . . . . . . . . . . . 35

4.7 Products under Big Five factors . . . . . . . . . . . . . . . . . . . 35

viii

Chapter 1

Introduction

With millions of users, social networking services like Facebook [2] and Twitter [3]

have become some of the most popular internet applications. These applications

are sources of knowledge and information. The rich knowledge that has been

accumulated in these social networking sites enables a variety of recommendation

systems for new users and media [4]. To use such opportunity, it is possible to

create automated system that can categorize social network users according to

Big Five [1] personality factors. To categorize users in such categorization system,

users’ data are needed to be collected without interfering their daily activities.

Thus the system will help people to know about other people. For example: An

employee needs vacation and if his boss is listed as a friend on OSN (Online Social

Networks) then the employee gets the chance to apply for his demand according

to the boss’s behavior determined by the system (Neuroticism [1] indicates higher

chances of disagree when Agreeableness [1] indicates higher chances of agree).

Online Social Networks (OSN) deal with big data, after analyzing such data, the

system will be able to predict a suitable person for leadership or people who may

oppose the leadership. Many challenges to recommendation systems have been

tackled by many new approaches, using different data sources and methodologies

to generate different kinds of recommendations. In this article we provide a

description of such systems.

From the very beginning, Consumer interests have a great influence on business

policy. Offering the right products or services to the right customers is the main

objective of every successful business policy. Many business organizations can

1

2

be benefited by using the data collected from OSN. At present the popularity of

social networks is increasing very rapidly. From sociologist’s points of view, OSN

can be characterized as “collective goods produced through computer mediated

collective action” [5]. Users spend a huge amount of time of their daily life

involving in OSN and share a lot of information about them and their friends

and families. So, this is a great opportunity to know about the sentiments and

the interests of the people. It is possible to understand the behavior of the users

of OSN as it becomes a crucial factor for advertising policies and better product

design.

In particular giving the success of item recommendation systems of commercial

websites, such as Amazon [6] and Netflix [7], it is considered worthwhile to revisit

the recommendation problem through the perspective of social networking. In

general, recommendation systems aim to provide personalized recommendations

of items to users based on their previous behavior as well as on other information

gathered by item descriptions and user profiles.

Our experiment is based on Twitter [3] and Facebook [2]; the most popular OSN

websites having a large place of advertisements. These websites have a very big

number of users and the users feel comfortable using these social networking sites

because of the user-friendly features of these sites such as micro-blogging, status

updating, photos and videos sharing, commenting on posts, joining and creating

groups, liking and subscribing pages and profiles, creating events, playing games

and so on.

We aim to analyze user behavior by the following steps- collecting the user’s

past activities in OSN, mapping it on Big Five factors [1], finding out a set of

particular interests field of the user and recommending him or her by giving

informative services.

2

Chapter 2

Previous work

OSN is the practice of expanding the number of business and social contacts of

a person by making connections through individuals [8]. In this era of internet

OSN is extremely popular among people. According to Nielsen Onlines report

two third of world population spent 10% of their time in internet in OSN [9]. As

OSN give opportunity to its user to express what he/she wants to say with their

friends, relatives and others connected through their OSN account. There are

huge amount of chances to identify/characterize one’s behavior types implicitly

without interfering his or her personal life [4].

2.1 Location Based Social Network [10]

A social network is a social structure made up of individuals connected by one or

more specific types of interdependency, such as friendship, common interests, and

shared knowledge. Generally, a social networking service builds on and reflects

the real-life social networks among people through online platforms such as a

website, providing ways for users to share ideas, activities, events, and interests

over the Internet. The increasing availability of location-acquisition technology

(for example GPS and Wi-Fi) empowers people to add a location dimension to

existing online social networks in a variety of ways. For example, users can upload

location-tagged photos to a social networking service such as Flickr [11], comment

3

2.1. Location Based Social Network [10] 4

on an event at the exact place where the event is happening (for instance, in Twit-

ter [3]), share their present location on a website (such as Foursquare [12]) for

organizing a group activity in the real world, record travel routes with GPS tra-

jectories to share travel experiences in an online community. Here, a location can

be represented in absolute (latitude-longitude coordinates), relative (100 meters

north of the Space Needle), and symbolic (home, office, or shopping mall) form.

Also, the location embedded into a social network can be a stand-alone instant

location of an individual, like in a bar at 9pm, or a location history accumulated

over a certain period, such as a GPS trajectory: a cinema a restaurant a park a

bar.

The dimension of location brings social networks back to reality, bridging the

gap between the physical world and online social networking services. For exam-

ple, a user with a mobile phone can leave his/her comments with respect to a

restaurant in an online social site (after finishing dinner) so that the people from

his/her social structure can reference his/her comments when they later visit the

restaurant. In this example, users create their own location-related stories in the

physical world and browse other peoples information as well. An online social site

becomes a platform for facilitating the sharing of peoples experiences. Further-

more, people in an existing social network can expand their social structure with

the new interdependency derived from their locations. As location is one of the

most important components of user context, extensive knowledge about an indi-

viduals interests and behavior can be learned from her locations. For instance,

people who enjoy the same restaurant can connect with each other. Individuals

constantly hiking the same mountain can be put in contact with each other to

share their travel experiences. Sometimes, two individuals who do not share the

same absolute location can still be linked as long as their locations are indicative

of a similar interest, such as beaches or lakes.

These kinds of location-embedded and location-driven social structures are known

as location-based social networks, formally defined as follows:

“A location-based social network (LBSN) [10] does not only mean adding a loca-

tion to an existing social network so that people in the social structure can share

location embedded information, but also consists of the new social structure made

up of individuals connected by the interdependency derived from their locations in

4

2.1. Location Based Social Network [10] 5

the physical world as well as their location-tagged media content, such as photos,

video, and texts. Here, the physical location consists of the instant location of an

individual at a given timestamp and the location history that an individual has

accumulated in a certain period. Further, the interdependency includes not only

that two persons co-occur in the same physical location or share similar location

histories but also the knowledge, e.g., common interests, behavior, and activities,

inferred from an individual’s location (history)and location-tagged data.”

In a location-based social network, people can not only track and share the

location-related information of an individual via either mobile devices or desktop

computers, but also leverage collaborative social knowledge learned from user gen-

erated and location-related content, such as GPS trajectories and geo-tagged pho-

tos. One example is determining this summers most popular restaurant by mining

peoples geo-tagged comments. Another example could be identifying the most

popular travel routes in a city based on a large number of users geo-tagged pho-

tos. Consequently, LBSNs enable many novel applications that change the way

we live, such as physical location (or activity) recommendation systems [13] [14]

and travel planning , while offering many new research opportunities for social

network analysis (like user modeling in the physical world and connection strength

analysis) [15] [16], spatio-temporal data mining [17], ubiquitous computing [18],

and spatio-temporal databases [17] [19] Existing applications providing location-

based social networking services can be broadly categorized into three folds: geo-

tagged-media-based, point-location-driven and trajectory-centric.

• Geo-tagged-media-based. [10] Quite a few geo-tagging services enable users

to add a location label to media content such as text, photos, and videos

generated in the physical world. The tagging can occur instantly when

the medium is generated, or after a user has returned home. In this way,

people can browse their content at the exact location where it was created

(on a digital map or in the physical world using a mobile phone). Users can

also comment on the media and expand their social structures using the

interdependency derived from the geo-tagged content (for example, in favor

of the same photo taken at a location). Representative websites of such

location-based social networking services include Flickr, Panoramio, and

5

2.1. Location Based Social Network [10] 6

Geo-twitter. Though a location dimension has been added to these social

networks, the focus of such services is still on the media content. That is,

location is used only as a feature to organize and enrich media content while

the major interdependency between users is based on the media itself.

• Point-location-driven. [10] Applications like Foursquare and Google Lati-

tude encourage people to share their current locations, such as a restaurant

or a museum. In Foursquare, points and badges are awarded for checking

in at venues. The individual with the most number of check-ins at a venue

is crowned Mayor. With the real-time location of users, an individual can

discover friends (from her social network) around her physical location so as

to enable certain social activities in the physical world, e.g., inviting people

to have dinner or go shopping. Meanwhile, users can add tips to venues

that other users can read, which serve as suggestions for things to do, see,

or eat at the location. With this kind of service, a venue (point location) is

the main element determining the in-terdependency connecting users, while

user-generated content such as tips and badges feature a point location.

• Trajectory-centric. [10] In a trajectory-centric social networking service,

such as Bikely, SportsDo, and Microsoft GeoLife, users pay attention to

both point locations (passed by a trajectory) and the detailed route con-

necting these point locations. These services do not only tell users basic

information, such as distance, duration, and velocity, about a particular

trajectory, but also show a users experiences represented by tags, tips, and

photos for the trajectory. In short, these services provide how and what

information in addition to where and when. In this way, other people can

reference a users travel/sports experience by browsing or replaying the tra-

jectory on a digital map, and follow the trajectory in the real world with a

GPS-phone.

6

2.1. Location Based Social Network [10] 7

Table 2.1 provides a brief comparison among the set here services. The major

differences between the point-location-driven and the trajectory-centric LBSN lie

in two aspects. One is that a trajectory offers richer information than a point

location, such as how to reach a location, the temporal duration that a user

stayed in a location, the time length for travelling between two locations, and the

physical/traffic conditions of a route. As a result, we are more likely to accurately

understand an individuals behavior and interests in a trajectory-centric LBSN.

The other is that in a point-location-driven LBSN users usually share their real-

time location while the trajectory-centric more likely delivers historical locations

as users typically prefer to upload a trajectory after a trip has finished (though

it can be operated in a continuously uploading manner). This property could

compromise some scenarios based on the real-time location of a user, however, it

reduces to some extent the privacy issues in a location-based social network. In

other words, when people see a users trajectory the user is no longer there.

Table 2.1. Comparison of different location based social networks

LBSN Services Focus Real-time Information

Geo-tagged-media-based Media Normal PoorPoint-location-driven Point location Instant Normal

Trajectory-centric Trajectory Relatively Slow Rich

Actually, the location data generated in the first two LBSN services can be

converted into the form of a trajectory which might be used by the third category

of LBSN service. For example, if we sequentially connect the point locations of

the geo-tagged photos taken by a user over several days, a sparse trajectory can be

formulated. Likewise, the check-in records of an individual ordered by time can

be regarded as a low-sampling-rate trajectory. However, due to the sparseness,

i.e., the distance and time interval between two consecutive points in a trajectory

could be very big, the uncertainty existing in a single trajectory from the first

two services is increased. Aiming to put these trajectories into trajectory-centric

LBSN services, we need to use them in a collective and collaborative way.

Trajectory data is the most complex data structure to be found in the three

7

2.2. Collaborative Recommendation Based Social Network [20] 8

LBSN services, and provides the richest information. If it is handled well, other

data sources become easier to deal with. Moreover, as mentioned above, loca-

tion data can be converted into a trajectory on many occasions. Consequently,

some methodologies designed for trajectory data can be employed by the first two

LBSN services.

2.2 Collaborative Recommendation Based So-

cial Network [20]

With the recent advances in technology, there is an emerging presence of social

media and social networking systems. In the case of multimedia enriched social

network systems, such as last.fm, the collective goods are musical tracks and the

collective action is the process of crafting individual profiles of musical preference

and linking them either explicitly, via bonds of friendship, or implicitly, through

collaborative annotation.

This collective action leads to the creation of an implicit social networking struc-

ture, which we aim to further explore. In particular given the success of item

recommendation systems in commercial websites, such as Amazon.com and Net-

flix, it is considered worthwhile to revisit the recommendation problem through

the novel perspective of social networking. In general, recommendation systems

aim to provide personalized recommendations of items to users based on their

previous behavior as well as on other information gathered by item descriptions

and user profiles.

However, no emphasis has been placed yet on personalization based explicitly on

social networks. The reason is that despite there is an increasing interest in the

exploration of social networks, there does not exist a concrete dataset that in-

cludes both explicit bonds of friendships among users and free-form collaborative

annotation of items. This is due to that most social media systems do not allow

for free access to all user profiles or lists of friends.

Given the incentives of the widespread add option of social networks and of the

8

2.2. Collaborative Recommendation Based Social Network [20] 9

lack of some previous study that directly addresses the problem of efficiently in-

tegrating the added value knowledge provided by those networks in the field of

collaborative recommendation, we propose a new methodology that tackles the

aforementioned issues. Within this context we make the following contributions:

• Kontas et al. [20] introduce a dataset based on data from the last.fm so-

cial network that describes a social graph among users, tracks and tags,

effectively including bonds of friendship and collaborative annotation.

• Kontas et al. [20] evaluate a Random Walk with Restarts (RWR) model

on this dataset and show that the incorporation of friendship and social

tagging can improve the performance of an item recommendation system.

• Kontas et al. [20] show that the RWR method outperforms the standard

Collaborative Filtering (CF) method, which we also evaluate against the

same dataset.

• Kontas et al. [20] show that our method using the RWR method requires

no training and successfully manages to capture

Kontas et al. [20] may distinguish two broad categories of collaborative recom-

mendation systems, namely content-based and collaborative filtering. A content-

based system selects items based on the correlation between the content of the

items (e.g. keywords describing the items, such as album genre, artists, etc., for

music tracks) and the users’ preferences [5]. However, it is limited to dictionary-

bound relations between the keywords used by users and the descriptions of items

and therefore does not explore implicit associations between users.

Collaborative filtering systems are divided into two categories, i.e. memory-

based and model-based. In the memory based systems [21] we calculate the

similarity between all users, based on their ratings of items using some heuristic

measure such as the cosine similarity or the Pearson correlation score. Then we

predict a missing rate by aggregating the ratings of the k nearest neighbors of

9

2.2. Collaborative Recommendation Based Social Network [20] 10

the user we want to recommend to. The problem with memory-based systems is

that we have to decide on a rather arbitrary basis over parameters such as the

number of neighbors. What is more, in the case of social networks there is no

straightforward way to introduce similarities between users based on friendships

and social tagging, other than some way of ad hoc interpolation of similarity

weights from those different sources.

The model-based filtering systems assume that the users build up clusters based

on their similar behavior in rating of items. A model is learned based on patterns

recognized in the rating behaviors of users using clustering, Bayesian networks

and other machine learning techniques [22] [23]. The problem with model-based

methods is that it is necessary to fine-tune several parameters of the model as

well as the fact that the models produced might not generalize well in radically

different context. What is more, as in the case of memory-based systems extra

effort and training needs to be done in order to introduce knowledge from social

networks.

Many research publications have been lately revolving around the area of so-

cial media. In particular, several studies focus on dataset collection and analysis

from social networks. Das et al. [24] proposed sample based algorithms that

capture information in the neighborhood of a user in dynamic social networks

utilizing random walks. Halpin et al. [25] studied the distribution of tags in

the social bookmarking site del.icio.us and proposed a generative model of col-

laborative tagging in order to evaluate the dynamics that lie beneath the act of

collaborative recommendation. Their findings prove that the dataset collected fol-

lows a power-law distribution. Even though both studies examine social networks

that are based on social tagging, they do not explore the dynamics of friendships

among users. Taking into account the power of free-form tagging of items by users

other than their authors/owners, researchers also focus on tag recommendation.

Subramanya and Liu [26] propose a system that automatically recommends tags

for blogs, using similarity ranking in a manner similar to collaborative filtering

techniques. Stromhaier [27] studies a novel idea in tag recommendation, which

bridges the gap between the keywords issued by a user in a query and the tags

actually used by a social system. He argues that the tags used by a user when

10

2.2. Collaborative Recommendation Based Social Network [20] 11

performing a query exhibit his or her intent, whereas the annotations of items

describe content semantics. As a result, he proposes a new form of purpose tags,

which extract the intent of the user and facilitate goal oriented search in a social

network. Both studies underline the importance and discriminative power of so-

cial tagging, which is also validated by our work.

Several studies exist in the field of applying Random Walks on bipartite

graphs. Craswell and Szummer [28] study a clickthrough data graph in order

to perform item recommendation. Nevertheless, no social content is available

between users. Yildirim and Krishnamoorthy [23] propose a novel recommenda-

tion algorithm which performs Random Walks on a graph that denotes similarity

measures between items. They evaluate their system using data from Movie Lens.

Although, the use of the Random Walk model performs well in the context of

recommendation, their use of an Item-Item similarity matrix raises some issues

as to the ability of the system to extend when other similarities are introduced

based on social tagging. Recent work has also been done in the field of applying

Random Walks over a social graph instead of bipartite graphs, similar to what

we propose in this paper. Clements et al. [29] propose a single term query system

performing Random Walks on graphs including users, items and tags. They use

data from LibraryThing, an online book catalogue where users rate and tag books

they have read. Due to lack of ground truth, they assume that the tags assigned

to an item by each user are the same as they would use as query terms to retrieve

the annotated item. We argue that this assumption is rather strong and that

a user experiment would be more appropriate in order to properly establish the

ground truth.

Hotho et al. evaluate a variation of adapted PageRank on a dataset from del.icio.us,

exploring folksonomies of bookmarks based also on collaborative annotation [30].

However, since they evaluate their proposed algorithm empirically, any compar-

ison attempts to their results becomes cumbersome. Although both studies are

close to our approach, we use a different model, namely RWR, in which we explic-

itly include friendships in our dataset and perform collaborative recommendations

instead of queries on the graph.

11

2.3. Sentiment Intensity Analysis of Informal Texts [31] 12

2.3 Sentiment Intensity Analysis of Informal Texts

[31]

The proliferation of social networks such as blogs, forums and other online means

of expression and communication have resulted in a landscape where people are

able to freely discuss online through a variety of means and applications.

Probably one of the most novel and interesting way of communication in cy-

berspace is through 3D virtual environments. In such environments, people, rep-

resented by their avatars, socialize and interact with each other and with virtual

humans operated by machines i.e., computer systems.

Despite the fact that the graphics of those environments remain relatively poor,

futuristic movies such as Avatar [32] provide an example of sophisticated land-

scapes and renderings that will be attainable by such environments in the fore-

seeable future. However, regardless of how attractive and realistic such artificial

3D worlds become, they will always remain heavily dependant on the quality of

human communication that takes place within them. As shown in [33] [34] [35],

communication in environments that are not limited to one, textual modality,

consists of not just semantic data transfer, but also of dense non-verbal commu-

nication where sentiment plays an important role. Moreover, without emotion

no consistent and coherent (virtual) body language is possible. Such primordial

movements include facial expressions, eye looks, arm-language coordination, etc.

Sentiment detection from textual utterances can play an important role in the

development of realistic and interactive dialog systems. Such systems serve var-

ious educational, business or entertainment oriented functions and also include

systems that are deployed in 3D virtual environments. With the aid of dialog

coherence” modules, conversational systems aim at a realistic interaction flow at

the emotional level e.g., Affect Listeners [36] and can greatly benefit from the

correct identification of the emotional state of their participants. Taking into

consideration that the majority of input to practical conversational systems con-

stitute of short, informal, textual exchanges, it is essential that the sentiment

analysis component integrated in the dialog system is able to cope with this type

of informal, often incomplete or ill-formed type of communication.

Sentiment analysis, the process of automatically detecting if a text segment con-

12

2.3. Sentiment Intensity Analysis of Informal Texts [31] 13

tains emotional or opinionated content and extracting its polarity or valence, is

a field of research that has received significant attention in recent years, both in

academia and in industry. The aforementioned increase of user-generated con-

tent on the web has resulted in a wealth of information that is potentially of vital

importance to institutions and companies, providing them with data to research

their consumers, manage their reputations and identify new opportunities. As

a result, most of the research in the field has been limited to product reviews,

where the aim is to predict whether the reviewer recommends a product or not,

based on the textual content of the review.

The focus of this paper is different. Instead of focusing our attention to prod-

uct reviews, we explore a more ubiquitous field of informal, social interactions in

cyberspace. The unprecedented popularity of social platforms such as Facebook,

Twitter, MySpace as well as 3D virtual worlds has resulted in an unparallel in-

crease of textual exchanges that remains relatively unexplored especially in terms

of its emotional content.

Specifically, Paltoglou et al. [31] aim to answer the following question: can lexicon-

based approaches perform more effectively than machine-learning approaches in

this domain? This question is particularly important, because previous research

in sentiment analysis using product reviews has shown that machine-learning ap-

proaches typically outperform lexicon-based ones but no exploration of whether

the same holds for informal, social interactions has been carried in the past. The

difference between the two domains is numerous. Firstly, reviews tend to be

longer and more verbose than typical social interactions which may only be a

few words long and often contain significant spelling errors [37]. Secondly, no

clear “golden standard” exists in the domain of informal communications with

which to train a machine-learning classifier in opposition to the “thumbs up” or

“thumbs down” feature of reviews. Lastly, social exchanges on the web tend to

be much more diverse in terms of their topics with issues ranging from politics

and recent news to religion while in contrast; product reviews by definition have

a specific subject, i.e. the product under discussion. The study of emotional

and social interactions in virtual worlds implies the study of virtual human (VH)

behaviors. Two types of VH exist: avatars (i.e. the projection of a real human in

the 3D environment) and agents (i.e. the projection of an autonomous machine

13

2.3. Sentiment Intensity Analysis of Informal Texts [31] 14

simulating a human in the virtual world). These VH types result in three possible

types of communications: avatar to avatar, agent to agent and avatar to agent.

Each one of those has the following interesting aspects respectively:

- A non verbal body language based on VH emotional states and mind profile.

- A potential visualization of the interaction from a third VH that should be

represented by an avatar.

- A non-verbal communication for the human representation and an action of

agent strongly influenced by interpreted emotions from the avatar. It

seems only logical that artificial intelligence and conversation systems would

strongly benefit these aspects in order to make the communication more re-

alistic. The structure of this paper is as follows. The next section provides

a brief overview of relevant work in sentiment analysis. Section 3 presents

the lexicon based classifier and section 4 presents the two machine-learning

classifiers that will be used in this study. Section 5 describes the data sets

that were used and explains the experimental setup while section 6 presents

and analyzes the results.

Finally, Paltoglou et al. [31] conclude and present some potential future directions

of research. Sentiment analysis, also known as opinion mining, has known con-

siderable interest recently. Most research has focused on analyzing the content

of either movie or general product reviews (e.g. [38]). Attempts to expand the

application of sentiment analysis to other domains, such as debates [39], news and

blogs [40] are also prominent. The seminal book of Pang and Lee [41] presents a

thorough analysis of the work in the field. In this section we will focus on the more

prominent work which is relevant to our approach. Pang et al. [46] were amongst

of the first to explore the sentiment analysis of reviews, focusing on machine-

learning approaches. These approaches generally function as follows: initially, a

general inductive process learns the characteristics of a class during a training

phase, by observing the properties of a number of pre classified documents (i.e.

reference corpus ) and applies the acquired knowledge to determine the best cat-

egory for new, unseen documents, during testing. Pang et al. [46] experimented

14

2.3. Sentiment Intensity Analysis of Informal Texts [31] 15

with three different algorithms: Support Vector Machines (SVMs), Naive Bayes

and Maximum Entropy classifiers, using a variety of features, such as unigrams

and bigrams, part-of-speech tags, binary and term frequency feature weights and

others. Their best attained accuracy in a dataset consisting of movie reviews, was

attained using a SVM classifier with binary features, although all three classifiers

gave very comparable performance. Other approaches (e.g. [42] [43]) have focused

on extending the feature set with semantically or linguistically-driven features

in order to improve classification accuracy. Dictionary/lexicon-based sentiment

analysis is typically based on lists of words with some sort of pre-determined

emotional weight. Examples of such dictionaries include the General Inquirer

(GI) dictionary [44] and the “Linguistic Inquiry and Word Count” (LIWC) soft-

ware [45], which are also used in the present study. Both lexicons are build with

the aid of experts that classify certain tokens in terms of their affective content

(e.g. positive or negative). The “Affective Norms for English Words” (ANEW)

lexicon [46] contains ratings of terms on a nine-point scale in regard to three

individual dimensions: valence, arousal and dominance. The ratings were pro-

duced manually by psychology class students. Ways to produce such emotional

dictionaries in an automatic or semi-automatic fashion have also been introduced

in research [47]. Emotional dictionaries have mostly been utilized in psychology

or sociology oriented research [48].

The idea of emotional conversationalists is relatively old. First attempts to create

such a system can be traced back to Parry [49], a chatterbot intended for studying

the nature of paranoia and able to express fears, anxieties or beliefs. More recent

work include research on the development of synthetic characters and chatterbots

with personalities [50] and studies on emotional responses and their influence on

the creation of believable agents or interactive virtual personalities [51]. In [52]

authors focused on the role of emotions for gaining rapport in spoken dialog sys-

tems by rendering responses that contain suitable emotion, both lexically and

auditory. Studies on the role of facial expressions in building rapport in a virtual

human-users interactions were conducted in [53]. A chatterbot system that gen-

erates emotional responses by selecting and displaying expressive images of the

character emulated by the chatterbot was presented in [54]. It has been almost

two decades that emotional communication for virtual worlds is a challenging

15

2.4. Big Five modeling [1] 16

research field. One of the pioneer paper has been proposed by Cassel et al. [55].

In the proposed system, conversations between multiple human-like agents were

automatically generates and animates with appropriate and synchronized speech,

intonation, facial expressions, and hand gestures proposed numerous ways to

design personality and emotion models for virtual humans. More recently, pre-

dicted a specific personality and emotional states from hierarchical fuzzy rules to

facilitate personality and emotion control, and in 2009, Pelachaud et al. [56] de-

veloped a model of behavior expressivity using a set of six parameters that act as

modulation of behavior animation. Finally, this year, [35] introduced a graphical

representation of human emotion extracted from text sentences. The main con-

tributions of that approach included an original pipeline that extracts, processes,

and renders emotion of 3D VH. Additionally, the paper presented methods to

optimize the computational pipeline so that real time virtual reality rendering

can be achieved on common PCs. Lastly, it was demonstrated how the Poisson

distribution can be utilized to transfer database extracted lexical and language

parameters into coherent intensities of valence and arousal (i.e. parameters of

Russell’s circumplex model of emotion).

2.4 Big Five modeling [1]

At present, many researchers believe that there are five core personality traits

and the evidence of this theory has been growing over the past 50 years [1]. From

the point of view of a sociologist, social media can be characterized as collective

goods produced through computer-mediated collective action [57]. While people

of each category have different attitude corresponding sites, taste of products,

different skill to accomplish work. The five factors are Extraversion, Agreeable-

ness, Conscientiousness, Neuroticism and Openness [58]. The people of different

categories have different ways to express their thoughts and OSN users have dif-

ferent level of significance to express their thoughts or behavior [1] [4]. The users

of OSN can be categorize according to Big Five factors. The behavior of an OSN

user varies from users location to location but there is a similarity having same

behavior in people from same or nearby location [59]. Behavior also varies from

16

2.4. Big Five modeling [1] 17

different aged people.

The personality traits used in the 5 factor model are Extraversion, Agreeableness,

Conscientiousness, Neuroticism and Openness to experience [58]. It is important

to ignore the positive or negative associations that these words have in everyday

language. For example, Agreeableness is obviously advantageous for achieving

and maintaining popularity. Agreeable people are better liked than disagreeable

people. On the other hand, agreeableness is not useful in situations that require

tough or totally objective decisions. Disagreeable people can make excellent sci-

entists, critics, or soldiers. Remember, none of the five traits is in themselves

positive or negative, they are simply characteristics that individuals exhibit to a

greater or lesser extent.

Each of these 5 personality traits describes, relative to other people, the frequency

or intensity of a person’s feelings, thoughts, or behaviors. Everyone possesses all

5 of these traits to a greater or lesser degree. For example, two individuals could

be described as agreeable (agreeable people value getting along with others). But

there could be significant variation in the degree to which they are both agree-

able. In other words, all 5 personality traits exist on a continuum rather than as

attributes that a person does or does not have.

Each of the Big Five personality traits is made up of 6 facets or sub traits. These

can be assessed independently of the trait that they belong to.

• Extraversion

Extraversion is marked by pronounced engagement with the external world.

Extraverts enjoy being with people, are full of energy, and often experience

positive emotions. They tend to be enthusiastic, action-oriented, individu-

als who are likely to say “Yes!” or “Let’s go!” to opportunities for excite-

ment. In groups they like to talk, assert themselves, and draw attention to

themselves. Introverts lack the exuberance, energy, and activity levels of

extraverts. They tend to be quiet, low-key, deliberate, and disengaged from

the social world. Their lack of social involvement should not be interpreted

as shyness or depression; the introvert simply needs less stimulation than

an extravert and prefers to be alone. The independence and reserve of the

introvert is sometimes mistaken as unfriendliness or arrogance. In reality,

17

2.4. Big Five modeling [1] 18

an introvert who scores high on the agreeableness dimension will not seek

others out but will be quite pleasant when approached.

Extraversion Facets:

– Friendliness. Friendly people genuinely like other people and openly

demonstrate positive feelings toward others. They make friends quickly

and it is easy for them to form close, intimate relationships. Low scor-

ers on Friendliness are not necessarily cold and hostile, but they do

not reach out to others and are perceived as distant and reserved.

– Gregariousness. Gregarious people find the company of others pleas-

antly stimulating and rewarding. They enjoy the excitement of crowds.

Low scorers tend to feel overwhelmed by, and therefore actively avoid,

large crowds. They do not necessarily dislike being with people some-

times, but their need for privacy and time to themselves is much greater

than for individuals who score high on this scale.

– Assertiveness. High scorers Assertiveness like to speak out, take charge,

and direct the activities of others. They tend to be leaders in groups.

Low scorers tend not to talk much and let others control the activities

of groups.

– Activity Level. Active individuals lead fast-paced, busy lives. They

move about quickly, energetically, and vigorously, and they are in-

volved in many activities. People who score low on this scale follow a

slower and more leisurely, relaxed pace.

– Excitement-Seeking. High scorers on this scale are easily bored with-

out high levels of stimulation. They love bright lights and hustle and

bustle. They are likely to take risks and seek thrills. Low scorers are

overwhelmed by noise and commotion and are adverse to thrill-seeking.

– Cheerfulness. This scale measures positive mood and feelings, not neg-

ative emotions (which are a part of the Neuroticism domain). Persons

who score high on this scale typically experience a range of positive

18

2.4. Big Five modeling [1] 19

feelings, including happiness, enthusiasm, optimism, and joy. Low

scorers are not as prone to such energetic, high spirits.

• Agreeableness

Agreeableness reflects individual differences in concern with cooperation

and social harmony. Agreeable individuals value getting along with others.

They are therefore considerate, friendly, generous, helpful, and willing to

compromise their interests with others’. Agreeable people also have an op-

timistic view of human nature. They believe people are basically honest,

decent, and trustworthy. Disagreeable individuals place self-interest above

getting along with others. They are generally unconcerned with others’

well-being, and therefore are unlikely to extend themselves for other peo-

ple. Sometimes their skepticism about others’ motives causes them to be

suspicious, unfriendly, and uncooperative. Agreeableness is obviously ad-

vantageous for attaining and maintaining popularity. Agreeable people are

better liked than disagreeable people. On the other hand, agreeableness is

not useful in situations that require tough or absolute objective decisions.

Disagreeable people can make excellent scientists, critics, or soldiers.

Agreeableness Facets:

– Trust. A person with high trust assumes that most people are fair,

honest, and have good intentions. Persons low in trust may see others

as selfish, devious, and potentially dangerous.

– Morality. High scorers on this scale see no need for pretence or ma-

nipulation when dealing with others and are therefore candid, frank,

and sincere. Low scorers believe that a certain amount of deception in

social relationships is necessary. People find it relatively easy to relate

to the straightforward high-scorers on this scale. They generally find

it more difficult to relate to the low-scorers on this scale. It should be

made clear that low scorers are not unprincipled or immoral; they are

simply more guarded and less willing to openly reveal the whole truth.

19

2.4. Big Five modeling [1] 20

– Altruism. Altruistic people find helping other people genuinely re-

warding. Consequently, they are generally willing to assist those who

are in need. Altruistic people find that doing things for others is a

form of self-fulfillment rather than self-sacrifice. Low scorers on this

scale do not particularly like helping those in need. Requests for help

feel like an imposition rather than an opportunity for self-fulfillment.

– Cooperation. Individuals who score high on this scale dislike con-

frontations. They are perfectly willing to compromise or to deny their

own needs in order to get along with others. Those who score low on

this scale are more likely to intimidate others to get their way.

– Modesty. High scorers on this scale do not like to claim that they are

better than other people. In some cases this attitude may derive from

low self-confidence or self-esteem. Nonetheless, some people with high

self-esteem find immodesty unseemly. Those who are willing to de-

scribe themselves as superior tend to be seen as disagreeably arrogant

by other people.

– Sympathy. People who score high on this scale are tender-hearted and

compassionate. They feel the pain of others vicariously and are easily

moved to pity. Low scorers are not affected strongly by human suf-

fering. They pride themselves on making objective judgments based

on reason. They are more concerned with truth and impartial justice

than with mercy.

• Conscientiousness

Conscientiousness concerns the way in which we control, regulate, and direct

our impulses. Impulses are not inherently bad; occasionally time constraints

require a snap decision, and acting on our first impulse can be an effective

response. Also, in times of play rather than work, acting spontaneously

and impulsively can be fun. Impulsive individuals can be seen by others as

colorful and fun-to-be-with.

Nonetheless, acting on impulse can lead to trouble in a number of ways.

Some impulses are antisocial. Uncontrolled antisocial acts not only harm

20

2.4. Big Five modeling [1] 21

other members of society, but also can result in retribution toward the

perpetrator of such impulsive acts. Another problem with impulsive acts is

that they often produce immediate rewards but undesirable, long-term con-

sequences. Examples include excessive socializing that leads to being fired

from one’s job, hurling an insult that causes the breakup of an important

relationship, or using pleasure-inducing drugs that eventually destroy one’s

health.

Impulsive behavior, even when not seriously destructive, diminishes a per-

son’s effectiveness in significant ways. Acting impulsively disallows con-

templating alternative courses of action, some of which would have been

wiser than the impulsive choice. Impulsivity also sidetracks people during

projects that require organized sequences of steps or stages. Accomplish-

ments of an impulsive person are therefore small, scattered, and inconsis-

tent.

A hallmark of intelligence, what potentially separates human beings from

earlier life forms, is the ability to think about future consequences before

acting on an impulse. Intelligent activity involves contemplation of long-

range goals, organizing and planning routes to these goals, and persisting

toward one’s goals in the face of short-lived impulses to the contrary. The

idea that intelligence involves impulse control is nicely captured by the term

prudence, an alternative label for the Conscientiousness domain. Prudent

means both wise and cautious. Persons who score high on the Conscien-

tiousness scale are, in fact, perceived by others as intelligent.

The benefits of high conscientiousness are obvious. Conscientious individ-

uals avoid trouble and achieve high levels of success through purposeful

planning and persistence. They are also positively regarded by others as

intelligent and reliable. On the negative side, they can be compulsive perfec-

tionists and workaholics. Furthermore, extremely conscientious individuals

might be regarded as stuffy and boring. Unconscientious people may be

criticized for their unreliability, lack of ambition, and failure to stay within

the lines, but they will experience many short-lived pleasures and they will

never be called stuffy.

21

2.4. Big Five modeling [1] 22

Conscientiousness Facets:

– Self-Efficacy. Self-Efficacy describes confidence in one’s ability to ac-

complish things. High scorers believe they have the intelligence (com-

mon sense), drive, and self-control necessary for achieving success. Low

scorers do not feel effective, and may have a sense that they are not in

control of their lives.

– Orderliness. Persons with high scores on orderliness are well-organized.

They like to live according to routines and schedules. They keep lists

and make plans. Low scorers tend to be disorganized and scattered.

– Dutifulness. This scale reflects the strength of a person’s sense of duty

and obligation. Those who score high on this scale have a strong sense

of moral obligation. Low scorers find contracts, rules, and regulations

overly confining. They are likely to be seen as unreliable or even

irresponsible.

– Achievement-Striving. Individuals who score high on this scale strive

hard to achieve excellence. Their drive to be recognized as successful

keeps them on track toward their lofty goals. They often have a strong

sense of direction in life, but extremely high scores may be too single-

minded and obsessed with their work. Low scorers are content to get

by with a minimal amount of work, and might be seen by others as

lazy.

– Self-Discipline. What many people call will-power refers to the ability

to persist at difficult or unpleasant tasks until they are completed.

People who possess high self-discipline are able to overcome reluctance

to begin tasks and stay on track despite distractions. Those with low

self-discipline procrastinate and show poor follow-through, often failing

to complete tasks-even tasks they want very much to complete.

– Cautiousness. Cautiousness describes the disposition to think through

possibilities before acting. High scorers on the Cautiousness scale take

their time when making decisions. Low scorers often say or do first

22

2.4. Big Five modeling [1] 23

thing that comes to mind without deliberating alternatives and the

probable consequences of those alternatives.

• Neuroticism

The term neurosis is used to describe a condition marked by mental distress,

emotional suffering, and an inability to cope effectively with the normal de-

mands of life. It is suggested that everyone shows some signs of neurosis,

but that we differ in our degree of suffering and our specific symptoms of

distress. Today neuroticism refers to the tendency to experience negative

feelings. Those who score high on Neuroticism may experience primarily

one specific negative feeling such as anxiety, anger, or depression, but are

likely to experience several of these emotions. People high in neuroticism

are emotionally reactive. They respond emotionally to events that would

not affect most people, and their reactions tend to be more intense than

normal. They are more likely to interpret ordinary situations as threaten-

ing, and minor frustrations as hopelessly difficult. Their negative emotional

reactions tend to persist for unusually long periods of time, which means

they are often in a bad mood. These problems in emotional regulation can

diminish a neurotic’s ability to think clearly, make decisions, and cope ef-

fectively with stress.

At the other end of the scale, individuals who score low in neuroticism are

less easily upset and are less emotionally reactive. They tend to be calm,

emotionally stable, and free from persistent negative feelings. Freedom from

negative feelings does not mean that low scorers experience a lot of positive

feelings; frequency of positive emotions is a component of the Extraversion

domain.

Neuroticism Facets:

– Anxiety. The ”fight-or-flight” system of the brain of anxious individ-

uals is too easily and too often engaged. Therefore, people who are

high in anxiety often feel like something dangerous is about to happen.

23

2.4. Big Five modeling [1] 24

They may be afraid of specific situations or be just generally fearful.

They feel tense, jittery, and nervous.

– Anger. Persons who score high in Anger feel enraged when things do

not go their way. They are sensitive about being treated fairly and

feel resentful and bitter when they feel they are being cheated. This

scale measures the tendency to feel angry; whether or not the person

expresses annoyance and hostility depends on the individual’s level on

Agreeableness. Low scorers do not get angry often or easily.

– Depression. This scale measures the tendency to feel sad, dejected,

and discouraged. High scorers lack energy and have difficult initiating

activities. Low scorers tend to be free from these depressive feelings.

– Self-Consciousness. Self-conscious individuals are sensitive about what

others think of them. Their concern about rejection and ridicule cause

them to feel shy and uncomfortable abound others. They are eas-

ily embarrassed and often feel ashamed. Their fears that others will

criticize or make fun of them are exaggerated and unrealistic, but

their awkwardness and discomfort may make these fears a self-fulfilling

prophecy. Low scorers, in contrast, do not suffer from the mistaken

impression that everyone is watching and judging them. They do not

feel nervous in social situations.

– Immoderation. Immoderate individuals feel strong cravings and urges

that they have difficulty resisting. They tend to be oriented toward

short-term pleasures and rewards rather than long-term consequences.

Low scorers do not experience strong, irresistible cravings and conse-

quently do not find themselves tempted to overindulge.

– Vulnerability. High scorers on Vulnerability experience panic, confu-

sion, and helplessness when under pressure stress. Low scorers feel

more poised, confident, and clear-thinking when stressed.

24

2.4. Big Five modeling [1] 25

• Openness to Experience

Openness to Experience describes a dimension of cognitive style that dis-

tinguishes imaginative, creative people from down-to-earth, conventional

people. Open people are intellectually curious, appreciative of art, and

sensitive to beauty. They tend to be, compared to closed people, more

aware of their feelings. They tend to think and act in individualistic and

nonconforming ways. Intellectuals typically score high on Openness to Ex-

perience; consequently, this factor has also been called Culture or Intellect.

Nonetheless, Intellect is probably best regarded as one aspect of openness

to experience. Scores on Openness to Experience are only modestly related

to years of education and scores on standard intelligent tests.

Another characteristic of the open cognitive style is a facility for thinking in

symbols and abstractions far removed from concrete experience. Depend-

ing on the individual’s specific intellectual abilities, this symbolic cognition

may take the form of mathematical, logical, or geometric thinking, artistic

and metaphorical use of language, music composition or performance, or

one of the many visual or performing arts. People with low scores on open-

ness to experience tend to have narrow, common interests. They prefer the

plain, straightforward, and obvious over the complex, ambiguous, and sub-

tle. They may regard the arts and sciences with suspicion, regarding these

endeavors as abstruse or of no practical use. Closed people prefer familiar-

ity over novelty; they are conservative and resistant to change. Openness

is often presented as healthier or more mature by psychologists, who are

often themselves open to experience. However, open and closed styles of

thinking are useful in different environments. The intellectual style of the

open person may serve a professor well, but research has shown that closed

thinking is related to superior job performance in police work, sales, and a

number of service occupations.

Openness to Experience Facets:

– Imagination. To imaginative individuals, the real world is often too

25

2.4. Big Five modeling [1] 26

plain and ordinary. High scorers on this scale use fantasy as a way of

creating a richer, more interesting world. Low scorers are on this scale

are more oriented to facts than fantasy.

– Artistic Interests. High scorers on this scale love beauty, both in art

and in nature. They become easily involved and absorbed in artistic

and natural events. They are not necessarily artistically trained or

talented, although many will be. The defining features of this scale

are interest in, and appreciation of natural and artificial beauty. Low

scorers lack aesthetic sensitivity and interest in the arts.

– Emotionality. Persons high on Emotionality have good access to and

awareness of their own feelings. Low scorers are less aware of their

feelings and tend not to express their emotions openly.

– Adventurousness. High scorers on adventurousness are eager to try

new activities, travel to foreign lands, and experience different things.

They find familiarity and routine boring, and will take a new route

home just because it is different. Low scorers tend to feel uncomfort-

able with change and prefer familiar routines.

– Intellect. Intellect and artistic interests are the two most important,

central aspects of openness to experience. High scorers on Intellect love

to play with ideas. They are open-minded to new and unusual ideas,

and like to debate intellectual issues. They enjoy riddles, puzzles, and

brain teasers. Low scorers on Intellect prefer dealing with people or

things rather than ideas. They regard intellectual exercises as a waste

of time. Intellect should not be equated with intelligence. Intellect

is an intellectual style, not an intellectual ability, although high scor-

ers on Intellect score slightly higher than low-Intellect individuals on

standardized intelligence tests.

– Liberalism. Psychological liberalism refers to a readiness to challenge

authority, convention, and traditional values. In its most extreme

form, psychological liberalism can even represent outright hostility to-

ward rules, sympathy for law-breakers, and love of ambiguity, chaos,

and disorder. Psychological conservatives prefer the security and sta-

26

2.4. Big Five modeling [1] 27

bility brought by conformity to tradition. Psychological liberalism and

conservatism are not identical to political affiliation, but certainly in-

cline individuals toward certain political parties.

It is possible, although unusual, to score high in one or more facets of a per-

sonality trait and low in other facets of the same trait. For example, you could

score highly in Imagination, Artistic Interests, Emotionality and Adventurous-

ness, but score low in Intellect and Liberalism.

27

Chapter 3

Research Questions

The main objective of this paper is to draw user’s virtual behavior model by an-

alyzing his/her OSN existence and to recommend products to the user on basis

of the user’s behavior model. To reach our main goal, we need to consider few

sub objectives, such as collecting user’s social network activities, analyizing the

user’s activity for few days, categorize the user’s activity in Big Five factors, rec-

ommending some services or products to the user on basis of the user’s behavior

model.

In order to fulfill our objectives some research questions will arise. The main

research question of this paper is: How to categorize users of OSN according to

Big Five factors from their behaviours in OSN? The sub research questions are

1. How do OSN(Online Social Networks) represent one user?

2. How can we analysis user behavior ?

3. How to categorize user behavior in Big Five factors?

28

Chapter 4

Proposed Research Methodology

In this paper our aim is to make relationship among text corpus from social

network with psychological theory of personality. We will also try to imple-

ment a recommendation system based on behavior analysis. So correlational and

exploratory methodologies are used in this paper where our concept is Behav-

ior indicator in Big Five Modeling and variables are Extraversion, Neuroticism,

Agreeableness, Openness and Conscientiousness.

• 4.1 Data Collection: In this research to categorize user’s behavior the big

data is collected. The data is collected from OSN(Twitter). The data is

stored in OSN by user’s activities such as posts by the user, posts by the

user’s friends, liked pages etc. The collected data is the public data so there

is no barrier to use these data. At a time a user’s previous 30 days data

will be collected. Data will be directly collected by the system from OSN

by full user authorization. After collecting data it will be stored in system

database with security.

Twitter, a social network site, can be used for sentiment analysis as it has

a very large number of short messages created by its users [60]. So we used

Twitter to collect users’ data. Using Twitter REST api 1.1, we collected

public tweets and re-tweets. Our twitter app requires users to authorize

the app for extracting data from their profiles. The twitter app will not

collect data if users do not allow it to run. We made sure all data we

29

30

USER

LIWC

Mapping

OSN(Twitter)

Twitter API

Represents

Figure 4.1. Modeling User Behavior

extract from twitter is public data. By calling get statuses/user timeline

and get statuses/retweets of me methods we can collect the user’s tweets

and retweets. The system can also collect public data from profiles that the

user is currently following by using get friends/ids method. The data we

collected are in json format and our twitter app can write the data to text

files. As separated files are easier to use we separated each user’s data file

by using user’s unique identifier- userid or username.

30

31

• 4.2 Data Analysis: Text file which contain past data of a single user is an-

alyzed through LIWC (Linguistic Inquiry and Word Count). It is a text

analysis software program designed by James W. Pennebaker, Roger J.

Booth and Martha E. Each text file analyzed by LIWC2007 can be treated

as a whole or broken into segments. It counts the words according to its

dictionary. After finishing this process it saves in a specified file where the

result is written on the below corresponding its category. Where, these

categories indicate different aspects of Big Five factors. On basis of these

results the modelling is implemented. The data table is given below which

shows which category lies in which factor.

Table 4.1. Relationship between LIWC categories and Big Five factors

Big Five factors LIWC CategoriesExtraversion Social process, Family, Friends, Humans, Affec-

tive, Biological process, Sexual, AchievementOpenness to Experience Leisure, Insight, Body, IngestionNeuroticism Swear words, Negation, Negative emotion, Anger

, Sadness, SexualConscientiousness Relativity, Motion, Space, Time, Religion, Death,

Money, CertaintyAgreeableness Positive Emotion, Feel, Discrepancy(would), Ten-

tative(maybe), Hear

The collected data is analyzed by LIWC to split every sentence. Then

according to the Big Five factors and the meaning and the use of words

there will be a percentage marking. After marking the percentage will be

summed and the higher marking category will be taken as user behavior.

31

32

• 4.3 Results Result of total counted words provided by LIWC is in percentage.

LIWC gives the result in such way:

result=(TC*100)/WC Where WC = total words in text file. TC = total

words in category.

The opposite method is used to know the exact number of words. Where,

TC=(result*100)/WC

Then which categories lie in same factor of the Big Five factors, values

of those categories are summed using linear regression formula. Linear

regression f(X)=X1+X2+X3+. . . +Xi

We used percentaged value of each factor.

Percentage formula part/whole=%/100

These results are used to draw the pie chart using EXCEL.

Example:

Figure 4.2. Pie Chart of LIWC Results

32

33

USER

Figure 4.3. Personality Based Recommendation System

• 4.4 Recommendation Analysis: Depending on the behavior analysis some

brands of products are suggested or recommended to users. Major percent-

age of behavior can influence one to like a particular type of products. There

are some examples given in table below which show majority of people hav-

ing a particular behavior have interest on a particular brand or product or

service. The following tables show some examples of recommendations.

33

34

As for example user A, B and C are followers of Age of Empires game

page in Twitter. After analyzing their tweets and retweets, machine maps

their behavior and it seems that major part of their behavior is extrovert.

And now after analyzing the tweets and retweets of user X if machine finds

that majority of his behavior is influenced by extroversion then we can

recommend him games like Age of Empires.

Table 4.2. Products under Big Five factors

Big Five Factors Product Categories/Brands

Video GamesExtraversion Strategy(Age of Empires, Commandos)

Openness to Experience Racing(Need for Speed)Neuroticism Shooting(Call of duty, Counter Strike)

Conscientiousness Chess, SudokuAgreeableness Sports(Fifa)

Table 4.3. Products under Big Five factors

Big Five Factors Product Categories/Brands

MoviesExtraversion Political, Fantasy, Family

Openness to Experience Comedy, Sports, DramaNeuroticism Crime scene, Action, Horror

Conscientiousness Political, Historical, ConspiracyAgreeableness Romantic, Drama

Table 4.4. Products under Big Five factors

Big Five Factors Product Categories/Brands

MusicExtraversion Rock

Openness to Experience Classical, Vocal, Country woodNeuroticism Pop, Heavy Metal

Conscientiousness New Released, HistoricAgreeableness Romantic, Country

34

35

Table 4.5. Products under Big Five factors

Big Five Factors Product Categories/Brands

FoodExtraversion Bead, Meat

Openness to Experience Multicultural Food, PizzaNeuroticism Fast Food

Conscientiousness Salad, VegetableAgreeableness Bread, Cheese

Table 4.6. Products under Big Five factors

Big Five Factors Product Categories/Brands

BeverageExtraversion Coffee, Tea

Openness to Experience Milkshake, Green TeaNeuroticism Soft Drinks

Conscientiousness Green tea, Black CoffeeAgreeableness coffee, tea, soft Drinks

Table 4.7. Products under Big Five factors

Big Five Factors Product Categories/Brands

SportsExtraversion Football, Athletics

Openness to Experience Cricket, SwimNeuroticism Boxing, Rugby, Marshal arts

Conscientiousness Athletics, Marshal artsAgreeableness Gymnastics

35

Chapter 5

Conclusions

In our thesis we proved that personality can be automated through analyzing

language cues. There has been little work done regarding to this field and to

the very best of our knowledge our research is one of the very first researches to

examine the recognition of personality and to introduce recommendation system

based on sentiment analysis results. During our research we realized that feature

selection is one of the most important tasks, as some of the best models only

contain a small subset of all feature set.

LIWC features are beneficial for all traits. For all recognition tasks we an-

alyzed the influence of the most relevant individual features in specific models.

We also used Stanford NLP (natural language processing) application to analyze

and split the texts. Later we only used LIWC because it generates more accurate

results than Standard NLP for our data analysis.

At this moment our system can only use text information. But in future our

system will be able to analyze data from shared links or videos. Our system

cannot identify quotations (which user uses to share others speech). The system

lacks the ability to understand double negatives in a sentence. For example: “The

service of Samsung Galaxy S3 is not very bad”.

There is a big scope of analyzing exclamatory sentences or smileys(sentimental

expressions). Our system can not understand sarcastic behavior at this moment.

Recommendation system on brands depends more accurately on percentage of

36

37

Big Five factors. Depth of measuring and scale of marking will be more efficient.

37

Bibliography

[1] K. Cherry, “The big five personality dimensions,” 2012. Accessed: 2010-09-

30.

[2] “Facebook.com.” Accessed: 2014-06-01.

[3] “Twitter.com.” Accessed: 2014-06-01.

[4] J. Bao, Y. Zheng, and M. F. Mokbel, “Location-based and preference-aware

recommendation using sparse geo-social networking data,” in Proceedings of

the 20th International Conference on Advances in Geographic Information

Systems, pp. 199–208, ACM, 2012.

[5] A. M. Ferman, J. H. Errico, P. v. Beek, and M. I. Sezan, “Content-based

filtering and personalization using structured metadata,” in Proceedings of

the 2nd ACM/IEEE-CS joint conference on Digital libraries, pp. 393–393,

ACM, 2002.

[6] “Amazon.com.” Accessed: 2014-04-01.

[7] “Netflix.com.” Accessed: 2014-04-01.

[8] F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida, “Characterizing user

behavior in online social networks,” in Proceedings of the 9th ACM SIG-

COMM conference on Internet measurement conference, pp. 49–62, ACM,

2009.

[9] N. O. Report, “Social networks and blogs now 4th most popular online ac-

tivity.”

38

BIBLIOGRAPHY 39

[10] Y. Zheng, “Location-based social networks: Users,” in Computing with Spa-

tial Trajectories, pp. 243–276, Springer, 2011.

[11] “Flickr.com.” Accessed: 2014-04-01.

[12] “Foursquare.com.” Accessed: 2014-01-01.

[13] X. Cao, G. Cong, and C. S. Jensen, “Mining significant semantic loca-

tions from gps data,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2,

pp. 1009–1020, 2010.

[14] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma, “Mining interesting locations

and travel sequences from gps trajectories,” in Proceedings of the 18th inter-

national conference on World wide web, pp. 791–800, ACM, 2009.

[15] Q. Li, Y. Zheng, X. Xie, Y. Chen, W. Liu, and W.-Y. Ma, “Mining user sim-

ilarity based on location history,” in Proceedings of the 16th ACM SIGSPA-

TIAL international conference on Advances in geographic information sys-

tems, p. 34, ACM, 2008.

[16] X. Xiao, Y. Zheng, Q. Luo, and X. Xie, “Finding similar users using category-

based location history,” in Proceedings of the 18th SIGSPATIAL Interna-

tional Conference on Advances in Geographic Information Systems, pp. 442–

445, ACM, 2010.

[17] W. Liu, Y. Zheng, S. Chawla, J. Yuan, and X. Xing, “Discovering spatio-

temporal causal interactions in traffic data streams,” in Proceedings of the

17th ACM SIGKDD international conference on Knowledge discovery and

data mining, pp. 1010–1018, ACM, 2011.

[18] Y. Zheng, Q. Li, Y. Chen, X. Xie, and W.-Y. Ma, “Understanding mobility

based on gps data,” in Proceedings of the 10th international conference on

Ubiquitous computing, pp. 312–321, ACM, 2008.

[19] L. Wang, Y. Zheng, X. Xie, and W.-Y. Ma, “A flexible spatio-temporal

indexing scheme for large-scale gps track retrieval,” in Mobile Data Man-

agement, 2008. MDM’08. 9th International Conference on, pp. 1–8, IEEE,

2008.

39

BIBLIOGRAPHY 40

[20] I. Konstas, V. Stathopoulos, and J. M. Jose, “On social networks and col-

laborative recommendation,” in Proceedings of the 32nd international ACM

SIGIR conference on Research and development in information retrieval,

pp. 195–202, ACM, 2009.

[21] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, “An algorithmic

framework for performing collaborative filtering,” in Proceedings of the 22nd

annual international ACM SIGIR conference on Research and development

in information retrieval, pp. 230–237, ACM, 1999.

[22] G. Adomavicius and A. Tuzhilin, “Toward the next generation of recom-

mender systems: A survey of the state-of-the-art and possible extensions,”

Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 6,

pp. 734–749, 2005.

[23] H. Yildirim and M. S. Krishnamoorthy, “A random walk method for allevi-

ating the sparsity problem in collaborative filtering,” in Proceedings of the

2008 ACM conference on Recommender systems, pp. 131–138, ACM, 2008.

[24] G. Das, N. Koudas, M. Papagelis, and S. Puttaswamy, “Efficient sampling of

information in social networks,” in Proceedings of the 2008 ACM workshop

on Search in social media, pp. 67–74, ACM, 2008.

[25] H. Halpin, V. Robu, and H. Shepherd, “The complex dynamics of collabora-

tive tagging,” in Proceedings of the 16th international conference on World

Wide Web, pp. 211–220, ACM, 2007.

[26] S. B. Subramanya and H. Liu, “Socialtagger-collaborative tagging for blogs

in the long tail,” in Proceedings of the 2008 ACM workshop on Search in

social media, pp. 19–26, ACM, 2008.

[27] M. Strohmaier, “Purpose tagging: capturing user intent to assist goal-

oriented social search,” in Proceedings of the 2008 ACM workshop on Search

in social media, pp. 35–42, ACM, 2008.

40

BIBLIOGRAPHY 41

[28] N. Craswell and M. Szummer, “Random walks on the click graph,” in Pro-

ceedings of the 30th annual international ACM SIGIR conference on Re-

search and development in information retrieval, pp. 239–246, ACM, 2007.

[29] M. Clements, A. P. de Vries, and M. J. Reinders, “Optimizing single term

queries using a personalized markov random walk over the social graph,”

in Workshop on Exploiting Semantic Annotations in Information Retrieval

(ESAIR), 2008.

[30] A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme, Information retrieval in

folksonomies: Search and ranking. Springer, 2006.

[31] G. Paltoglou, S. Gobron, M. Skowron, M. Thelwall, and D. Thalmann, “Sen-

timent analysis of informal textual communication in cyberspace,” Proc. En-

gage, pp. 13–25, 2010.

[32] “Avatarmovie.com.” Accessed: 2014-04-01.

[33] A. Kappas, U. Hess, and K. R. Scherer, “6. voice and emotion,” Fundamen-

tals of nonverbal behavior, p. 200, 1991.

[34] P. Becheiraz and D. Thalmann, “A model of nonverbal communication

and interpersonal relationship between virtual actors,” in Computer Ani-

mation’96. Proceedings, pp. 58–67, IEEE, 1996.

[35] S. Gobron, J. Ahn, G. Paltoglou, M. Thelwall, and D. Thalmann, “From sen-

tence to emotion: a real-time three-dimensional graphics metaphor of emo-

tions extracted from text,” The Visual Computer, vol. 26, no. 6-8, pp. 505–

519, 2010.

[36] M. Skowron, “Affect listeners: Acquisition of affective states by means of

conversational systems,” in Development of Multimodal Interfaces: Active

Listening and Synchrony, pp. 169–181, Springer, 2010.

[37] M. Thelwall and D. Wilkinson, “Public dialogs in social network sites: What

is their purpose?,” Journal of the American Society for Information Science

and Technology, vol. 61, no. 2, pp. 392–404, 2010.

41

BIBLIOGRAPHY 42

[38] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classi-

fication using machine learning techniques,” in Proceedings of the ACL-02

conference on Empirical methods in natural language processing-Volume 10,

pp. 79–86, Association for Computational Linguistics, 2002.

[39] M. Thomas, B. Pang, and L. Lee, “Get out the vote: Determining support

or opposition from congressional floor-debate transcripts,” in Proceedings of

the 2006 conference on empirical methods in natural language processing,

pp. 327–335, Association for Computational Linguistics, 2006.

[40] I. Ounis, C. Macdonald, and I. Soboroff, “Overview of the trec-2008 blog

track,” tech. rep., DTIC Document, 2008.

[41] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations

and trends in information retrieval, vol. 2, no. 1-2, pp. 1–135, 2008.

[42] T. Mullen and N. Collier, “Sentiment analysis using support vector machines

with diverse information sources.,” in EMNLP, vol. 4, pp. 412–418, 2004.

[43] C. Whitelaw, N. Garg, and S. Argamon, “Using appraisal groups for senti-

ment analysis,” in Proceedings of the 14th ACM international conference on

Information and knowledge management, pp. 625–631, ACM, 2005.

[44] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing contextual polarity in

phrase-level sentiment analysis,” in Proceedings of the conference on human

language technology and empirical methods in natural language processing,

pp. 347–354, Association for Computational Linguistics, 2005.

[45] J. W. Pennebaker, M. E. Francis, and R. J. Booth, “Linguistic inquiry and

word count: Liwc 2001,” Mahway: Lawrence Erlbaum Associates, vol. 71,

p. 2001, 2001.

[46] M. Bradley and P. Lang, “Affective norms for english words (anew): Techni-

cal manual and affective ratings,” Gainesville, FL: The Center for Research

in Psychophysiology, University of Florida, 1999.

[47] J. Brooke, M. Tofiloski, and M. Taboada, “Cross-linguistic sentiment analy-

sis: From english to spanish.,” in RANLP, pp. 50–54, 2009.

42

BIBLIOGRAPHY 43

[48] R. B. Slatcher, C. K. Chung, J. W. Pennebaker, and L. D. Stone, “Winning

words: Individual differences in linguistic style among us presidential and

vice presidential candidates,” Journal of Research in Personality, vol. 41,

no. 1, pp. 63–75, 2007.

[49] K. M. Colby, S. Weber, and F. D. Hilf, “Artificial paranoia,” Artificial In-

telligence, vol. 2, no. 1, pp. 1–25, 1971.

[50] F. Barthelemy, B. Dosquet, S. Gries, and X. Magnant, “Believable synthetic

characters in a virtual emarket,” in Artificial Intelligence and Applications:

IASTED International Conference Proceedings, as part of the 22 nd IASTED

International Multi-Conference on Applied Informatics, 2004.

[51] J. Bates et al., “The role of emotion in believable agents,” Communications

of the ACM, vol. 37, no. 7, pp. 122–125, 1994.

[52] J. C. Acosta, “Using emotion to gain rapport in a spoken dialog system,”

in Proceedings of Human Language Technologies: The 2009 Annual Confer-

ence of the North American Chapter of the Association for Computational

Linguistics, Companion Volume: Student Research Workshop and Doctoral

Consortium, pp. 49–54, Association for Computational Linguistics, 2009.

[53] J. Gratch, N. Wang, J. Gerten, E. Fast, and R. Duffy, “Creating rapport

with virtual agents,” in Intelligent Virtual Agents, pp. 125–138, Springer,

2007.

[54] P. Turney and M. L. Littman, “Unsupervised learning of semantic orientation

from a hundred-billion-word corpus,” 2002.

[55] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket,

B. Douville, S. Prevost, and M. Stone, “Animated conversation: rule-based

generation of facial expression, gesture & spoken intonation for multiple con-

versational agents,” in Proceedings of the 21st annual conference on Com-

puter graphics and interactive techniques, pp. 413–420, ACM, 1994.

[56] C. Pelachaud, “Studies on gesture expressivity for a virtual agent,” Speech

Communication, vol. 51, no. 7, pp. 630–639, 2009.

43

BIBLIOGRAPHY 44

[57] J. C. Ward and A. L. Ostrom, “The internet as information minefield:

an analysis of the source and content of brand information yielded by net

searches,” Journal of Business research, vol. 56, no. 11, pp. 907–914, 2003.

[58] S. Bai, T. Zhu, and L. Cheng, “Big-five personality prediction based on user

behaviors at social network sites,” arXiv preprint arXiv:1204.4809, 2012.

[59] M. Smith, V. Barash, L. Getoor, and H. W. Lauw, “Leveraging social context

for searching social media,” in Proceedings of the 2008 ACM workshop on

Search in social media, pp. 91–94, ACM, 2008.

[60] A. Pak and P. Paroubek, “Twitter as a corpus for sentiment analysis and

opinion mining.,” in LREC, 2010.

44