mike thelwall: introduction to webometrics
DESCRIPTION
Presentation to the second LIS DREaM workshop held at the British Library on Monday 30th January 2012.More information available at: http://lisresearch.org/dream-project/dream-event-3-workshop-monday-30-january-2012/TRANSCRIPT
Introduction to Webometrics
Mike Thelwall@mikethelwall
Professor of Information Science,Statistical Cybermetrics Research Gp,
University of Wolverhampton
Reminder of pre-workshop task
Delegates were asked to join YouTube and leave comments and replies to earlier comments on the video:
Department of Library and Information Science, Delhihttp://www.youtube.com/watch?v=_-OAsF9uRfc
These contributions will form part of the discussion at the end of the session, and include reference to the self-declared age and gender information from YouTube.
Reminder of pre-workshop task
Delegates were asked to join YouTube and leave comments and replies to earlier comments on the video:
Department of Library and Information Science, Delhihttp://www.youtube.com/watch?v=_-OAsF9uRfc
These contributions will form part of the discussion at the end of the session, and include reference to the self-declared age and gender information from YouTube.
Overview of “Webometrics”
• What is Webometrics? – Gathering, processing and analysing large scale data from the
web (web pages, hyperlinks, blogs, Web 2.0) for many purposes that include online communication
• What can Webometrics offer other researchers?– Software to gather data from web sites, search engines, social
network sites and blogs; methods to extract useful patterns • Common data sources
– Webometric Analyst software: Twitter, YouTube, the Web, Technorati, Bing
– Bespoke: Any resource with an API, page scraping of other sites, SocSciBot web crawler
http://lexiurl.wlv.ac.uk
1. Background: Webometrics
• Webometrics is about gathering data on the Web, and measuring aspects of the Web:– web sites– web pages– hyperlinks– YouTube video commenter networks– web search engine results– MySpace Friend networks– Twitter or blog trends
• …for varied social science purposes
New problems: Web-based phenomena
• Webometrics can analyse online academic communication– Why do academic web sites interlink?– Which academic web sites interlink?– What academic interlinking patterns exist?– Which web sites/groups/documents have the
most online impact, and why?
Old problems: Offline phenomena reflected online
• Some offline phenomena have measurable online reflections– International communication– Inter-university collaboration– University-business collaboration– The impact or spread of ideas– Public opinion about science
Example: The online impact of research groups (NetReAct)
Normalised linking, smallest countries removed
Geopoliticalconnected
SwedenFinland
Norway
UK
Germany
Austria Switzerland
Poland
Italy
Belgium
Spain
France
NL
Example:Links betweenEU universities
International biofuels research network
Data Gathering/Processing Tools
• Webometric Analyst – web citations, web text, YouTube, Flickr, Technorati– Submits thousands of queries to Bing and
summarises the results in standard ways
• SocSciBot – links, web text– Web Crawler &
analyser
2. altmetrics in traditional research evaluation
• Altmetrics can supplement traditional citation impact with non-traditional online impact – E.g., educational, discussion-based
• Often weaker than citation data but useful for research groups that have non-standard types of impacts
The Integrated Online Impact Indicator (IOI)
• Combines a range of online sources into one indicator– Google Scholar +– Google Books +– Course reading lists +– Google Blogs +– PowerPoint presentations = IOI
• OR select individual separate components
Invented by Kayvan Kousha
New source 1: Google Scholar
• Wider evidence of academic impact• Wider types of academic publications, some
non-academic publications• Not reliable• Coverage variable• Can’t be automatically queried• Free
New source 2: Google Books
• Books typically not indexed in WoS or Scopus• Relevant in book-based disciplines (arts,
humanities, some social sciences)• Reliability unknown but probably not good• Coverage variable• Can be automatically queried• Free [Clifford Lynch]
New source 3: Course reading lists
• Evidence of educational impact• Can automatically construct queries to detect
individual articles in online syllabuses• Get results via advanced Google/Yahoo/Live
Search queries• Works for most articles
– Fails for short common article titles
New source 4: Blogs
• Evidence of impact on discussions• Educational impact, public dissemination
evidence, academic impact in discursive subjects?
• Not possible to automate in the largest database (Google Blogs)?
• Not a well researched area
New source 5: PowerPoint Presentations
• Evidence of educational/scholarly impact• Especially relevant for discursive subjects?• Automated Live Search/Yahoo advanced
queriesIOI = a*Scholar + b*PowerPoint + c*Blogs + d*
Syllabus + e* Books
Or use qualitative analyses of the different sources
2. Sentiment Strength Detection in the Social Web with SentiStrength• Detect positive and negative sentiment strength
in short informal text– Develop workarounds for lack of standard grammar
and spelling– Harness emotion expression forms unique to MySpace
or CMC (e.g., :-) or haaappppyyy!!!)– Classify simultaneously as positive 1-5 AND negative 1-
5 sentiment
Thelwall, M., Buckley, K., & Paltoglou, G. (in press). Sentiment strength detection for the social Web.Journal of the American Society for Information Science and Technology.
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text.Journal of the American Society for Information Science and Technology, 61(12), 2544-2558.
SentiStrength Algorithm - Core
• List of 2,489 positive and negative sentiment term stems and strengths (1 to 5), e.g.– ache = -2, dislike = -3, hate=-4, excruciating -5– encourage = 2, coolest = 3, lover = 4
• Sentiment strength is highest in sentence; or highest sentence if multiple sentences
• My legs ache.
• You are the coolest.
• I hate Paul but encourage him.
-2
3
-4 2
1, -2
positive, negative
3, -1
2, -4
Extra sentiment methods
• spelling correction nicce -> nice• booster words alter strength very happy• negating words flip emotions not nice• repeated letters boost sentiment/+ve niiiice • emoticon list :) =+2• exclamation marks count as +2 unless –ve hi!• repeated punctuation boosts sentiment good!!!• negative emotion ignored in questions u h8 me?• Sentiment idiom list shock horror = -2
Online as http://sentistrength.wlv.ac.uk/
Tests against human coders
Data set
Positivescores -correlation with humans
Negativescores -correlation with humans
YouTube 0.589 0.521
MySpace 0.647 0.599
Twitter 0.541 0.499
Sports forum 0.567 0.541
Digg.com news 0.352 0.552
BBC forums 0.296 0.591
All 6 data sets 0.556 0.565
SentiStrengthagrees withhumansas much as theyagree with eachother 1 is perfect agreement, 0 is random agreement
Why the bad results for BBC? (and Digg)
• Irony, sarcasm and expressive language e.g.,– David Cameron must be very happy that I have
lost my job.– It is really interesting that David Cameron and
most of his ministers are millionaires.– Your argument is a joke.
$
2. Twitter – sentiment in major media events
• Analysis of a corpus of 1 month of English Twitter posts (35 Million, from 2.7M accounts)
• Automatic detection of spikes (events)• Assessment of whether sentiment changes during
major media events
Automatically-identified Twitter spikes
9 Mar 2010
9 Feb 2010
Proportion of tweetsmentioning keyword
Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406-418.
Chile
matc
hin
g p
ost
sSenti
ment
stre
ngth
Sub
j.
Increase in –ve sentiment strength
9 Feb 2010
9 Feb 2010
Date and time
Date and time
9 Mar 2010
9 Mar 2010
Av. +ve sentimentJust subj.Av. -ve sentimentJust subj.
Proportion of tweetsmentioning Chile
#oscars
% m
atc
hin
g p
ost
sSenti
ment
stre
ngth
Sub
j.
Increase in –ve sentiment strength
Date and time
Date and time
9 Feb 2010
9 Feb 2010
9 Mar 2010
9 Mar 2010
Av. +ve sentimentJust subj.Av. -ve sentimentJust subj.
Proportion of tweetsmentioning the Oscars
Sentiment and spikes
• Statistical analysis of top 30 events:– Strong evidence that higher volume hours have
stronger negative sentiment than lower volume hours
– No evidence that higher volume hours have different positive sentiment strength than lower volume hours
=> Spikes are typified by small increases in negativity
3. YouTube Video comments
• 1000 comm. per video via Webometric Analyst (or the YouTube API)
• Good source of social web text data• Analysis of all comments on a pseudo-random
sample of 35,347 videos with < 1000 comments
Using Webometric Analyst
• Download free from http://lexiurl.wlv.ac.uk• Start, select classic interface, YouTube Tab
Reply networks
• Illustrate the replies to a YouTube video in network form
• Reveal age and gender of posters• Reveal patterns of discussion in the replies (if
any)• Take up to 25 minutes to make per video with
Webometric Analyst
Reply networkExtended coreinteractions2x2=5 video
Nodes (people)blue = malepink = femaleArrows (replies)red = happy repliesblack = angry replies
The 10 most ridiculous Black Metal videos-A very sparse reply network-Nodes are mostly connected in 2s and 3s
Black Metal vs. Deathcore-Denser reply network-On-going debates and
contentious issues
Other networks
• Connections can be• Friendship in YouTube (must be reciprocal)• Subscription in YouTube (non-reciprocal, based upon
interest in video content)• Friends in common in YouTube
– suggests factors in common (e.g., bands) rather than people in common
• Subscriptions in common in YouTube – again suggests factors in common (e.g., bands) rather than
people in common
Very sparseFriendnetwork= common
Common Friendsnetwork – with adensely connectedcore
Large-scale analysis of YouTube• Purpose: to discover patterns, norms and unusual
behaviour in YouTube• Method:1. Generate a large sample of YouTube videos
a. Running searches for many terms from a large word list
b. Selecting a video at random from each set of results2. Extract properties of the videos and commenters3. Calculate averages and distributions4. Examine extreme videos
Comments are mainly positive
Controversial and non-controversial?
Conclusions• Sentiment analysis allows large-scale social web
analysis• YouTube videos can be analysed easily with a network
perspective on their comments
• http://lexiurl.wlv.ac.uk/ - Webometric Analyst
• Can make a reply network of the discussions of video_-OAsF9uRfc by following the instructions at http://lexiurl.wlv.ac.uk/searcher/youtubereplies.html