mike thelwall: introduction to webometrics

44
Introduction to Webometrics Mike Thelwall @mikethelwall Professor of Information Science, Statistical Cybermetrics Research Gp, University of

Upload: library-and-information-science-research-coalition

Post on 07-May-2015

2.606 views

Category:

Technology


8 download

DESCRIPTION

Presentation to the second LIS DREaM workshop held at the British Library on Monday 30th January 2012.More information available at: http://lisresearch.org/dream-project/dream-event-3-workshop-monday-30-january-2012/

TRANSCRIPT

Page 1: Mike Thelwall: Introduction to Webometrics

Introduction to Webometrics

Mike Thelwall@mikethelwall

Professor of Information Science,Statistical Cybermetrics Research Gp,

University of Wolverhampton

Page 2: Mike Thelwall: Introduction to Webometrics

Reminder of pre-workshop task

Delegates were asked to join YouTube and leave comments and replies to earlier comments on the video:

Department of Library and Information Science, Delhihttp://www.youtube.com/watch?v=_-OAsF9uRfc

These contributions will form part of the discussion at the end of the session, and include reference to the self-declared age and gender information from YouTube.

Reminder of pre-workshop task

Delegates were asked to join YouTube and leave comments and replies to earlier comments on the video:

Department of Library and Information Science, Delhihttp://www.youtube.com/watch?v=_-OAsF9uRfc

These contributions will form part of the discussion at the end of the session, and include reference to the self-declared age and gender information from YouTube.

Page 3: Mike Thelwall: Introduction to Webometrics

Overview of “Webometrics”

• What is Webometrics? – Gathering, processing and analysing large scale data from the

web (web pages, hyperlinks, blogs, Web 2.0) for many purposes that include online communication

• What can Webometrics offer other researchers?– Software to gather data from web sites, search engines, social

network sites and blogs; methods to extract useful patterns • Common data sources

– Webometric Analyst software: Twitter, YouTube, the Web, Technorati, Bing

– Bespoke: Any resource with an API, page scraping of other sites, SocSciBot web crawler

http://lexiurl.wlv.ac.uk

Page 4: Mike Thelwall: Introduction to Webometrics

1. Background: Webometrics

• Webometrics is about gathering data on the Web, and measuring aspects of the Web:– web sites– web pages– hyperlinks– YouTube video commenter networks– web search engine results– MySpace Friend networks– Twitter or blog trends

• …for varied social science purposes

Page 5: Mike Thelwall: Introduction to Webometrics

New problems: Web-based phenomena

• Webometrics can analyse online academic communication– Why do academic web sites interlink?– Which academic web sites interlink?– What academic interlinking patterns exist?– Which web sites/groups/documents have the

most online impact, and why?

Page 6: Mike Thelwall: Introduction to Webometrics

Old problems: Offline phenomena reflected online

• Some offline phenomena have measurable online reflections– International communication– Inter-university collaboration– University-business collaboration– The impact or spread of ideas– Public opinion about science

Page 7: Mike Thelwall: Introduction to Webometrics

Example: The online impact of research groups (NetReAct)

Page 8: Mike Thelwall: Introduction to Webometrics

Normalised linking, smallest countries removed

Geopoliticalconnected

SwedenFinland

Norway

UK

Germany

Austria Switzerland

Poland

Italy

Belgium

Spain

France

NL

Example:Links betweenEU universities

Page 9: Mike Thelwall: Introduction to Webometrics

International biofuels research network

Page 10: Mike Thelwall: Introduction to Webometrics

Data Gathering/Processing Tools

• Webometric Analyst – web citations, web text, YouTube, Flickr, Technorati– Submits thousands of queries to Bing and

summarises the results in standard ways

• SocSciBot – links, web text– Web Crawler &

analyser

Page 11: Mike Thelwall: Introduction to Webometrics

2. altmetrics in traditional research evaluation

• Altmetrics can supplement traditional citation impact with non-traditional online impact – E.g., educational, discussion-based

• Often weaker than citation data but useful for research groups that have non-standard types of impacts

Page 12: Mike Thelwall: Introduction to Webometrics

The Integrated Online Impact Indicator (IOI)

• Combines a range of online sources into one indicator– Google Scholar +– Google Books +– Course reading lists +– Google Blogs +– PowerPoint presentations = IOI

• OR select individual separate components

Invented by Kayvan Kousha

Page 13: Mike Thelwall: Introduction to Webometrics

New source 1: Google Scholar

• Wider evidence of academic impact• Wider types of academic publications, some

non-academic publications• Not reliable• Coverage variable• Can’t be automatically queried• Free

Page 14: Mike Thelwall: Introduction to Webometrics

New source 2: Google Books

• Books typically not indexed in WoS or Scopus• Relevant in book-based disciplines (arts,

humanities, some social sciences)• Reliability unknown but probably not good• Coverage variable• Can be automatically queried• Free [Clifford Lynch]

Page 15: Mike Thelwall: Introduction to Webometrics

New source 3: Course reading lists

• Evidence of educational impact• Can automatically construct queries to detect

individual articles in online syllabuses• Get results via advanced Google/Yahoo/Live

Search queries• Works for most articles

– Fails for short common article titles

Page 16: Mike Thelwall: Introduction to Webometrics

New source 4: Blogs

• Evidence of impact on discussions• Educational impact, public dissemination

evidence, academic impact in discursive subjects?

• Not possible to automate in the largest database (Google Blogs)?

• Not a well researched area

Page 17: Mike Thelwall: Introduction to Webometrics

New source 5: PowerPoint Presentations

• Evidence of educational/scholarly impact• Especially relevant for discursive subjects?• Automated Live Search/Yahoo advanced

queriesIOI = a*Scholar + b*PowerPoint + c*Blogs + d*

Syllabus + e* Books

Or use qualitative analyses of the different sources

Page 18: Mike Thelwall: Introduction to Webometrics

2. Sentiment Strength Detection in the Social Web with SentiStrength• Detect positive and negative sentiment strength

in short informal text– Develop workarounds for lack of standard grammar

and spelling– Harness emotion expression forms unique to MySpace

or CMC (e.g., :-) or haaappppyyy!!!)– Classify simultaneously as positive 1-5 AND negative 1-

5 sentiment

Thelwall, M., Buckley, K., & Paltoglou, G. (in press). Sentiment strength detection for the social Web.Journal of the American Society for Information Science and Technology.

Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text.Journal of the American Society for Information Science and Technology, 61(12), 2544-2558.

Page 19: Mike Thelwall: Introduction to Webometrics

SentiStrength Algorithm - Core

• List of 2,489 positive and negative sentiment term stems and strengths (1 to 5), e.g.– ache = -2, dislike = -3, hate=-4, excruciating -5– encourage = 2, coolest = 3, lover = 4

• Sentiment strength is highest in sentence; or highest sentence if multiple sentences

Page 20: Mike Thelwall: Introduction to Webometrics

• My legs ache.

• You are the coolest.

• I hate Paul but encourage him.

-2

3

-4 2

1, -2

positive, negative

3, -1

2, -4

Page 21: Mike Thelwall: Introduction to Webometrics

Extra sentiment methods

• spelling correction nicce -> nice• booster words alter strength very happy• negating words flip emotions not nice• repeated letters boost sentiment/+ve niiiice • emoticon list :) =+2• exclamation marks count as +2 unless –ve hi!• repeated punctuation boosts sentiment good!!!• negative emotion ignored in questions u h8 me?• Sentiment idiom list shock horror = -2

Online as http://sentistrength.wlv.ac.uk/

Page 22: Mike Thelwall: Introduction to Webometrics

Tests against human coders

Data set

Positivescores -correlation with humans

Negativescores -correlation with humans

YouTube 0.589 0.521

MySpace 0.647 0.599

Twitter 0.541 0.499

Sports forum 0.567 0.541

Digg.com news 0.352 0.552

BBC forums 0.296 0.591

All 6 data sets 0.556 0.565

SentiStrengthagrees withhumansas much as theyagree with eachother 1 is perfect agreement, 0 is random agreement

Page 23: Mike Thelwall: Introduction to Webometrics

Why the bad results for BBC? (and Digg)

• Irony, sarcasm and expressive language e.g.,– David Cameron must be very happy that I have

lost my job.– It is really interesting that David Cameron and

most of his ministers are millionaires.– Your argument is a joke.

$

Page 24: Mike Thelwall: Introduction to Webometrics

2. Twitter – sentiment in major media events

• Analysis of a corpus of 1 month of English Twitter posts (35 Million, from 2.7M accounts)

• Automatic detection of spikes (events)• Assessment of whether sentiment changes during

major media events

Page 25: Mike Thelwall: Introduction to Webometrics

Automatically-identified Twitter spikes

9 Mar 2010

9 Feb 2010

Proportion of tweetsmentioning keyword

Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406-418.

Page 26: Mike Thelwall: Introduction to Webometrics

Chile

matc

hin

g p

ost

sSenti

ment

stre

ngth

Sub

j.

Increase in –ve sentiment strength

9 Feb 2010

9 Feb 2010

Date and time

Date and time

9 Mar 2010

9 Mar 2010

Av. +ve sentimentJust subj.Av. -ve sentimentJust subj.

Proportion of tweetsmentioning Chile

Page 27: Mike Thelwall: Introduction to Webometrics

#oscars

% m

atc

hin

g p

ost

sSenti

ment

stre

ngth

Sub

j.

Increase in –ve sentiment strength

Date and time

Date and time

9 Feb 2010

9 Feb 2010

9 Mar 2010

9 Mar 2010

Av. +ve sentimentJust subj.Av. -ve sentimentJust subj.

Proportion of tweetsmentioning the Oscars

Page 28: Mike Thelwall: Introduction to Webometrics

Sentiment and spikes

• Statistical analysis of top 30 events:– Strong evidence that higher volume hours have

stronger negative sentiment than lower volume hours

– No evidence that higher volume hours have different positive sentiment strength than lower volume hours

=> Spikes are typified by small increases in negativity

Page 29: Mike Thelwall: Introduction to Webometrics

3. YouTube Video comments

• 1000 comm. per video via Webometric Analyst (or the YouTube API)

• Good source of social web text data• Analysis of all comments on a pseudo-random

sample of 35,347 videos with < 1000 comments

Page 30: Mike Thelwall: Introduction to Webometrics
Page 31: Mike Thelwall: Introduction to Webometrics

Using Webometric Analyst

• Download free from http://lexiurl.wlv.ac.uk• Start, select classic interface, YouTube Tab

Page 32: Mike Thelwall: Introduction to Webometrics

Reply networks

• Illustrate the replies to a YouTube video in network form

• Reveal age and gender of posters• Reveal patterns of discussion in the replies (if

any)• Take up to 25 minutes to make per video with

Webometric Analyst

Page 33: Mike Thelwall: Introduction to Webometrics

Reply networkExtended coreinteractions2x2=5 video

Nodes (people)blue = malepink = femaleArrows (replies)red = happy repliesblack = angry replies

Page 34: Mike Thelwall: Introduction to Webometrics

The 10 most ridiculous Black Metal videos-A very sparse reply network-Nodes are mostly connected in 2s and 3s

Page 35: Mike Thelwall: Introduction to Webometrics

Black Metal vs. Deathcore-Denser reply network-On-going debates and

contentious issues

Page 36: Mike Thelwall: Introduction to Webometrics

Other networks

• Connections can be• Friendship in YouTube (must be reciprocal)• Subscription in YouTube (non-reciprocal, based upon

interest in video content)• Friends in common in YouTube

– suggests factors in common (e.g., bands) rather than people in common

• Subscriptions in common in YouTube – again suggests factors in common (e.g., bands) rather than

people in common

Page 37: Mike Thelwall: Introduction to Webometrics

Very sparseFriendnetwork= common

Page 38: Mike Thelwall: Introduction to Webometrics

Common Friendsnetwork – with adensely connectedcore

Page 39: Mike Thelwall: Introduction to Webometrics

Large-scale analysis of YouTube• Purpose: to discover patterns, norms and unusual

behaviour in YouTube• Method:1. Generate a large sample of YouTube videos

a. Running searches for many terms from a large word list

b. Selecting a video at random from each set of results2. Extract properties of the videos and commenters3. Calculate averages and distributions4. Examine extreme videos

Page 40: Mike Thelwall: Introduction to Webometrics

Comments are mainly positive

Page 41: Mike Thelwall: Introduction to Webometrics
Page 42: Mike Thelwall: Introduction to Webometrics

Controversial and non-controversial?

Page 43: Mike Thelwall: Introduction to Webometrics

Conclusions• Sentiment analysis allows large-scale social web

analysis• YouTube videos can be analysed easily with a network

perspective on their comments

• http://lexiurl.wlv.ac.uk/ - Webometric Analyst

• Can make a reply network of the discussions of video_-OAsF9uRfc by following the instructions at http://lexiurl.wlv.ac.uk/searcher/youtubereplies.html

Page 44: Mike Thelwall: Introduction to Webometrics