crawling big data in a new frontier for socioeconomic research: testing with social tagging

40
1 Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging JUAN DIEGO BORRERO, [email protected] ESTRELLA GUALDA, [email protected] University of Huelva Seminários CIEO - Universidade do Algarve Faro, 31 October, 2012

Upload: juan-d-borrero

Post on 25-Jun-2015

256 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

1

Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with

Social TaggingJUAN DIEGO BORRERO, [email protected]

ESTRELLA GUALDA, [email protected]

University of Huelva

Seminários CIEO - Universidade do AlgarveFaro, 31 October, 2012

Page 2: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

2

Table of Contents

• 1. Introduction• 2. Theoretical perspective

– Web 2.0 and Collaborative tagging

– Tagging and Folksonomy– The collective knowledge

inherent in social tags– Tagging and Social

networks– Social Web and its impact

on Information Retrieval (IR) and Recommender Systems (RS)

• 3. Methodology– 3.1. Data Collection

procedure– 3.2. Analysis procedure.

SNA• 4. Results

– 4.1. Centralization: Authority– 4.2. Node Tags: Users

producing Tags• 5. Discussion

– 5.1. Centrality and Power– 5.2. Central Tags: Users

producing Tags• 6. Conclusions and future

research

Page 3: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

3

1. IntroductionWhat puzzles?

1. The era of Big Data and Social Media has begun!

E.g., Twitter, Facebook, Tumbrl, Delicious, Youtube, Flickr, Wikipedia…

2. Will it transform how we study human communication and social relations?

3. Will it alter what ‘research’ means?

Some or all of the above?

Page 4: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

4

1. IntroductionWhat puzzles?

1.Big Data is notable not because of its size, but because of its relationality to other data. Big Data is fundamentally networked. Its value comes from the patterns that can be derived by making connections between pieces of data, about an individual, about individuals in relation to others, about groups of people, or simply about the structure of information itself.

2. Big Data is important because it refers to an analytic phenomenon playing out in academia.

3. Big data is important because of its popular salience.

Page 5: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

5

1. IntroductionTagging

• New technologies have made it possible for a wide range of people to produce, share, interact with, and organize data.

• People can classify the huge amount of information at her/his disposal in the form of tags.

Page 6: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

6

1. IntroductionTagging in Delicious

Keywords freely chosen by users employed to annotate various types of digital content, or suggested by Delicious

Source: www.delicious.com

Page 7: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

7

1. IntroductionSocial Tagging Systems

Many users add metadata in the form of tags

Resulting collective tag structure

Source: http://www.idonato.com/2009/05/27/fun-with-tag-clouds/

Source: http://blog.hubspot.com/blog/tabid/6307/bid/7372/9-Reasons-Why-Your-Social-Media-Strategy-Isn-t-Working.aspx/

Source: http://bvdt.tuxic.nl/index.php/the-wisdom-of-the-crowds-in-the-audiovisual-archive-domain/

Page 8: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

8

1. IntroductionDelicious

Delicious is a free social bookmarking website for storing, sharing and discovering web bookmarks

Source: www.delicious.com

Page 9: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

9

1. IntroductionOur Assumption

• Big Data offers the humanistic disciplines a new way to work in the quantitative side and it also offers other kind of objective method for analysis.

• Although in reality, working with Big Data is still subjective.

• Due to this, it is crucial to begin asking questions about the analytic assumptions, methodological frameworks, and underlying biases embedded in the Big Data phenomenon.

Page 10: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

10

1. IntroductionOur Objectives

1.Proposing a methodology to use big data from Web 2.0 in social research,

2.Applying it to extract automatically data from Delicious social bookmarking website, and

3.To show the type of results that this kind of analysis can offer to social scientists.

4.We focus our study in globalization agriculture community, and pay special attention to SNA

Page 11: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

11

2. Theoretical perspectiveWeb 2.0… and collaborative tagging

Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as platform, and an attempt to understand the rules for success on that new platform (O’Reilly, 2007)

Collaborative – or social – tagging is the activity in the Web 2.0 of annotating digital resources with keywords - tags (Golder and Huberman, 2006; Trant, 2009).

Source: http://www.laurenwood.org/anyway/2007/11/web-20-buzzwords/

Page 12: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

12

2. Theoretical perspective… collaborative tagging

A collaborative tagging system is mainly composed of three interconnected components

users, tags, and resources(Smith, 2008)

Webpages, photos,

videos…

Collaborative – or social – tagging is the activity in the Web 2.0 of annotating digital resources with keywords - tags (Golder and Huberman, 2006; Trant, 2009).

Page 13: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

13

2. Theoretical perspective… collaborative tagging and folksonomy

Social tagging systems aggregate the tags of all users and describe the resources in a so-called folksonomy (Vander Wal, 2004)

Synonyms global warming = climate change

Terms variations globalization = globalisation poor=poors

problems

Page 14: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

14

2. Theoretical perspective… folksonomy and collective knowledge

Bottom-up process…

…the tags of many different users are aggregated and the resulting collective tag structure – such as tag cloud – depicts the collective knowledge of Web users (Cress et al., 2012)

Source: http://blog.cimmyt.org/?p=6052

Page 15: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

15

2. Theoretical perspectiveTagging and social networks

A particular class of networks is the bipartite networks, whose nodes are divided into two sets –e.g. users and tags.

An opinion network (Maslov and Zhang, 2001; Blattner et al., 2007), is a network in which users connect to the objects that they gather.

The structure of Social tagging websites can be viewed as a network of three different node types: the U users, the R resources (web sites – URLs) and the T tags that the U users deploy to tag the R web sites.

Source: Authors

Figure 1. A Bipartite Network made of three users U=(u,u’,u’’), three tags T=(t,t’,t’’) and two kinds of links: between users RU (straight lines), and between users and tags RT (dashed lines)

Page 16: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

16

2. Theoretical perspectiveSocial web and its impact on Information Retrieval (IR) and Recommender Systems (RS)

1. From Social IR point of view -i.e. IR that uses folksonomies- IT creates algorithms for folksonomies in order to identify which information is relevant and to identify communities to their need, this paper aims to exhibit a methodology to retrieve big data from Web 2.0 environment.

2. We introduce social tagging as basis for recommendations focused into a ternary relation between users, resources, and tags, to discover latent patterns links to the activity of collaborative tagging, which could be basic in order to provide effective recommendations to different actors.

Page 17: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

17

3. Methodology

• Data set from: Delicious – www.delicious.com –.

• Delicious = social bookmarking system whose – Content is created, annotated and viewed by its

users. – Non-hierarchical classification system: users can tag

each of their bookmarks on the Delicious website, and provides knowledge about the URL marked

– Collective nature: • view bookmarks added or annotated by other users. • organize existing tags into groups (tag bundles).

Page 18: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

18

3.1. Data Collection procedure

Collected annotations made in Social Bookmarking Services.At least four parts:• 1. Link to the resource (website…)• 2. One or more tags• 3. User who makes the annotation• 4. Moment/ time when the annotation is made

• This article focus more on the co-occurrence of users, resources and tags (user, resource, tag).

Dataset collected : U = {u1; u2; : : : ; uK}, R = {r1; r2; : : ; rM}, and T = {t1; t2; : : ; tN}

Page 19: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

19

3.1. Process to retrieve the data(A) Start point. Identify the search attributes. Authoritative source as baseline to find keywords connected to the idea of ‘globalization of agriculture’

– Wikipedia definition of “critics of globalization (popular, high reputation)

– Other starts points (future)– Selected (manually= researcher expertise) main

concepts from the website homepages, tag clouds or topics.

– Identified the 5 seed keywords (globalization + agriculture, food, organic, and GMO)

– Other concepts rejected

(B) With a Perl program web-crawling was made, gathering the sample of users, URLs and tags

- For globalization+agriculture; globalization+food; globalization+organic; globalization+GMO

- 22 April 2011 and 21 May 2011 (one completed month)

- Results: 10,220 taggings that involved 851 users on 1,077 URLs and 1,720 tags.

(C) Program in Haskell to reduce the amount of data by cutting the URLs and using key words, including the identification of synonyms, the elimination of words with capital letters and derivatives such as words in plural.

(D) Dataset for analysis

Figure 2. Data Collection Procedure

Source: Authors

Page 20: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

20

Example: final dataset

526 urls 1,700 tags 851 usersSource: Authors

Page 21: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

21

Table 1. Keywords Used in the topic “Globalization of agriculture”

Search attributesused

Number ofresulting tags

(I+II)

More frequent Tags /

Main Tags

Globalization (I) +agriculture (II)

1,116 Food (268), economics (176), environment (145), politics

(85), trade (81), sustainability (70)

Globalization (I) +food (II)

1,682 Economy (180), economics (171), environment (122), sustainability (78), politics

(60)

Globalization (I) +organic (II)

22 Business (3), fair-trade (3)

Globalization (I) +GMO (II)

54 Food (13), agriculture (12)

Source: Authors

Page 22: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

22

3.2. Analysis procedure: SNANetwork analysis

• Node centrality: identification of the nodes that are more “central” than othersNetwork level property = idea of the node’s social power based on how well it “connects” to the network.

• Degree of a node = Number of direct connections individuals have with others in the groupHighest degree = exerts influence (or authority).

In-degree = number of incoming ties that reflect the popularity of a website. As a result, the prominent, well-connected members (those with a high degree of centrality) are usually the opinion leaders.

Out-degree = number of outgoing ties which determine if a particular user is an active or passive participant within the network.

Software Pajek (big series of data): Delicious bookmarking system’s user is simply using Delicious, latent structures, power that emerges from

the network…

Page 23: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

23

Figure 3. Hyperlink Network Energy Kamada-Kawai Map.Bipartite Network userurl

Source: Authors by Pajek

Page 24: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

24

Results 4.1. Centralization (Authority)

Centralization: userURL

URL’s Indegree: Sum of total inbound linksUser’s Outdegree: Sum of the total outbound links

Network highly centralized within a few nodes:

Only 10 URLs from 526 (1.90%) account for 32.29% links to URLs.10 URLs got 3,290 inbound links from a total of 10,219.

Only 10 users from 851 (1.17%) account for 14.05% links to URLs.These 10 users produced 1,436 outbound links from a total of 10,219.

10 most centralized websites. Nine of them were media-based (online newpapers such as The New York Times, BBC, The Guardian, Washington Post, Financial Times, Reason, The Nation, Spiegel and The Economist) (Table 2)

Identification of Users with a greater degree of centrality.Mritiunjoy user play a very important role in the network. Mritiunjoy joined to Delicious on 12 march, 2007 and to the date he has 10,020 links and is following 38 users.Mritiunjoy Mohanty - is a professor at the Indian Institute of Management Calcutta, India and his Research Interests are Political Economy of growth and development.

Page 25: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

25

Table 2. Top Authoritative Sites in the hyperlink network

Indegree Outdegree

1 1203 http://www.nytimes.com/ 433 /mritiunjoy

2 674 http://news.bbc.co.uk/ 195 /laura208

3 365 http://www.guardian.co.uk/ 127 /rd108

4 186 http://www.washingtonpost.com/ 112 /amaah

5 158 http://www.ft.com/ 111 /thepouncer

6 154 http://www.reason.com/ 100 /anilius

7 147 http://www.thenation.com/ 100 /emmarlyb

8 137 http://www.spiegel.de/ 87 /adorngeography

9 136 http://www.foodfirst.org/ 86 /pagolnari

10 130 http://www.economist.com/ 85 /freemanlcSource: Authors

Page 26: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

26

Figure 4. user-user Unipartite Network Energy Kamada-Kawai MapDegree Cut-off = 1. Size: Degree

Source: Authors by Pajek

Page 27: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

27

Figure 5. user-user Unipartite Network Energy Kamada-Kawai Map

Degree Cut-off = 30. Nodes = 211. Size: Betweeness

Source: Authors by Pajek

Page 28: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

28

Source: Authors by Pajek

Figure 6. user-user Unipartite Network Energy Kamada-Kawai Map

Degree Cut-off = 30. Nodes = 211. Size: Closeness

Page 29: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

29

Source: Authors by Pajek

Figure 7. user-user Unipartite Network Energy Kamada-Kawai Map

Degree Cut-off = 30. Nodes = 211. Size: Degree

Page 30: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

30

Figure 8. Hyperlink Network. 851 users arranged in rank order by number of outbound links and 1,077 URLs arranged in rank order

by number of inbound links

Why?/ How come that a few users and websites are better connected than the majority?

Source: Authors

Page 31: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

31

Value of identified nodes (websites) due to:

• The links that they receive (its instrumental nature)

• The profile of these organizations (newspapers that channel big quantities of resources – information) (quality of the links) = central URLs with authority.

Page 32: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

32

Results. 4.2. Node Tags: Users producing Tags

• Collective tag structure (excluded the key search words, such as globalization, agriculture, food and organic, and GMO) produced with Wordle.

• Sizes of the terms in the tag clouds are proportional to the weights - the top 25 highest weighted tags.

• Tag clouds: identifying the topical groupings in a tag network– Identification of topics around globalization of

agriculture

Page 33: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

33

Figure 9. Tag Cloud for Agriculture Globalization Network Identified on the delicious Data Set

Resulting main key topics were economics and the environment Main keywords used by users to describe or characterise in Delicious the topic ‘globalization of agriculture’.

Source: Authors by wordle

Page 34: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

34

50 more frequent TAGS. Tags used more than 20 times

Economics 350 World 47 BBC 30

Environment 274 Global 46 Future 30

Sustainability 153 Capitalism 45 Geography 30

Politics 152 Green 43 Water 30

Economy 144 Research 42 Nutrition 29

Trade 131 Crisis 41 Government 27

Business 99 International 41 Wto 27

Poverty 97 Oil 38 Agribusiness 26

Culture 84 Prices 37 Ecology 25

Farming 84 Activism 35 Europe 25

Africa 83 News 35 Globalwarming 23

Health 78 Science 35 Reference 22

Development 76 Hunger 34 Technology 22

Energy 76 Usa 34 Biofuel 21

India 65 Inflation 32 Corporations 21

China 59 History 31 Farmers 21

Policy 55 Local 31

Page 35: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

35

Discussion: 5.1. Centrality and Power

New York Times in this network of globalization of agriculture in Delicious surpasses by far other URLs (with 1,203 inbound links, followed by BBC website with 674 ones).

Most cited, recommended or considered websites with regards to a topic occupy a central place and have an important role in the process of dissemination of news, events, trending topics, ideology, culture and etcetera.

Identification of key collective actors (represented here through URLs) allows a better comprehension of leadership, influence process, and power-related structures.

For social practitioners, is a good way to identify key informants in a community through whom disseminating useful and important information.

Very inequal distribution of power of the URLs cited by users in the topic globalization of agriculture.

- Important accumulation of inlinks.

ADVANTAGES OF THIS TYPE OF KNOWLEDGEFOR RESEARCHING AND INTERVENING

Page 36: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

36

Discussion. 5.1. Centrality and Power

• FOCUS ON Users: identification of key actors that disseminate and share URLs, as the previously cited Mritiunjoy– Determine from where key elements that structure the network

emerge. • Why ‘that’ so important actor in the network of

globalization of agriculture? – Key actors in this type of network could configure and

reconfigure the evolution of the network (TIME), and structure and even manipulate the type of interchange of resources in Delicious or in similar bookmarking sites.

• Is it by chance? Are most prominent actors in a type of website like Delicious corresponding to a profile of very active and participative people? Do they usually work (or have as hobby) in this area and this is why accumulate and tag so many URLs in Delicious? – Further steps of the research.

Page 37: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

37

5.2. Central Tags: Users producing Tags

• Tags suggested by the website + Added new tags in a creative way• ‘Tag cloud’: visual approach to the language used by users• From a total of 1700 tags two words were the main ones.• Each user could label a URL with an unlimited number of tags

(average 12 tags per user, max 433 and min 2). • Most frequently tags used were the words: ‘economics’ (350 citations

out of 1700 tags -20.6%-) and ‘environment’ (273, 16%). • Other very frequent tags were also sustainability (153), politics (152),

economy (144), trade (131), business (99), poverty (97), culture (84), farming (84), africa (83), health (78), and development (76), representing these 13 tags in relatives terms one out of four labelled tags around the topic (25,9%).

Questions: • Reasons of the prominence of the two first tags around the

globalization of agriculture. • Are some of the 1700 found tags used in a interchangeable basis?

– Why sometimes the word economics is used sometimes, and why other times is used economy?

– Are they used in the same way at classifying the URLs?

Page 38: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

38

Conclusions: achieved goals

• Presenting this methodology to use big data from Web 2.0 in socioeconomic research, and the illustration from a social bookmarking site (Delicious) is:

• A first step towards the development of empirical techniques capable of automatically differentiating groups of individuals with common interests, and individuals who occupy a more central position.

• First stone in the difficult process of understanding and discovering patterns in the process that characterize users tagging URLs for collaborative reasons.

• Utility: Discovering latent patterns = provide effective recommendations to different actors.

• Understanding the community of more than a thousand links. • Retrieval and analysis of information: complex but easy =

working in interdisciplary teams

Page 39: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

39

Other topics for Researching: Future

• Improvements are necessary regarding in retrieval methods and the implementation of Information Retrieval and Recommender Systems techniques

• Influence of first tags on the following ones. Role of innovation and creativity at tagging

• Evolution and usage of language around an issue along time.• Ideological and terminological approaches in the national/ international

arena • Use of some tags at classifying URLs and the distinction among users in

the way they use some words/tags– Distinction between scientifics/ other professionals or users? – Identify users with the same patterns at tagging, or URLs that were similarly

labelled: study structural equivalences• Other possible studies based in retrieving the pages and making content

analysis • Why some labels are present/ absent? • Are there “traditions”/ “fashions” at tagging in the Web 2.0? • Comparing results from Delicious and from other social bookmarking sites• Go in-depth about users (if possible)• And other explorations, other starting points, other bookmarking sites, other

indicators, complementary to those used in this illustration

Page 40: Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging

40

Possible Applications• Producing and manipulating public opinion (at recommending and

describing websites) and markets– If we know the interests of users belonging to a network, we could also be

able to make recommendations• Recommender Systems, changes into a ternary relation between

users, resources, and tags, more complex to manage. • Important for researchers interested in formulating strategies for

intervention and mobilisation, but also practitioners, and companies could make use of this.

• The discovering of the central elements in a network (users and URLs), at the same time that the tags used by users could be key to design future strategies for the dissemination of messages and to achieve more success in the communications, making use of important keywords, for instance, to atract more attention, etc.

• Implementation of Information Retrieval and Recommender Systems techniques in social commerce and social media contexts.

• Applications in advertising, mobilising, etc.• Security, Social Studies, Market studies, consumers• Time: longitudinal analysis• Etcétera