dissecting wikipedia

20
Dissecting Wikipedia Andrew Gray Wikipedian in Residence, British Library [email protected] // @generalising

Upload: andrew-gray

Post on 12-May-2015

597 views

Category:

Technology


2 download

DESCRIPTION

Talk on "Dissecting Wikipedia" given at CRASSH, Cambridge, on 6th March 2013. Abstract: Andrew Gray, the British Library's Wikipedian in Residence, has been working on an AHRC-supported program to help more academics and researchers engage with Wikipedia. In this talk, he will give a brief history of the Wikipedia project, looking at its origins and the way it has developed over time. The talk will also cover the growing amount of research done around Wikipedia itself. Well over 2,000 peer-reviewed papers have been published which looked at Wikipedia in some way - looking at the project's content and community, or using this data as a way to study broader questions of collaboration and interaction.

TRANSCRIPT

Page 1: Dissecting Wikipedia

Dissecting Wikipedia

Andrew Gray

Wikipedian in Residence, British Library

[email protected] // @generalising

Page 2: Dissecting Wikipedia

Wikipedia & Wikimedia

Wikimedia Movement and charitable body 80-100,000 contributors in 280 languages

and eleven core projects Image repository, dictionary, news site… …used by almost 500,000,000 people

Wikipedia 25,000,000 articles, 4,000,000 in English representing 8-9,000,000 topics & entities 6,500 articles and 235,000 edits per day

(…and twelve years ago, this was all fields…)

Page 3: Dissecting Wikipedia

…so what is Wikipedia?

…an encyclopedia (more or less)

…written neutrally

…and verifiably

…using previously published information

…free to use, distribute, or reuse

…a collaborative community

…with no firm rules

Page 4: Dissecting Wikipedia

A developing internal infrastructure

All edits are visible through watchlists and page histories About 7% are vandalism or malicious; processes to detect

these Median time to correction < 2 minutes… but some stay much

longer

Individual discussion pages for all articles – “talk”

Quality review and assessment process

Specialised working groups and central noticeboards eg/ content topics; style; dispute resolution; copyright; etc.

Page 5: Dissecting Wikipedia

Quality of Wikipedia as a source

On average… it’s not bad In 2005 four errors per article, versus three in Britannica In 2011, in English, Spanish & Arabic:

“…the Wikipedia articles in this sample scored higher overall than the comparison articles with respect to accuracy, references, style/ readability and overall judgment…”

Millions of articles – so many are, individually, problematic Various ways of identifying “signs” of quality Markers for quality are both obvious and subtle

Very effective “springboard” tool

Page 6: Dissecting Wikipedia

Moving to other content

Other languages – not translations, and may have more content

Mousing over footnote markers

Within the references: Links through DOIs and other identifiers ISBNs go to a special landing page

…and then out to libraries, booksellers, etc ISSNs go to WorldCat If an author, look for authority control links:

Page 7: Dissecting Wikipedia

Other research tools

Some tools available – “toolserver” allows live DB queries Complex to use, but rewarding

CatScan: look for intersection of categories “all physicists born in 1912” – 53 in English, 35 in German

Full dumps of all data available – http://dumps.wikipedia.org/

Reusers – Freebase, DBpedia, Wolfram Alpha

Page 8: Dissecting Wikipedia

Wikidata

Wikidata: our new linked data repository Phase I: cross-language links Phase II: structured data elements Phase III: dynamic lists

Very loosely defined schema

Currently harvesting structured data from WP

Public API, open to reusers

CC-0 licensed data – fully open

Page 9: Dissecting Wikipedia

Research about Wikipedia

Thriving research around Wikimedia communities & content by mid-2011, 2100 peer-reviewed articles and 38 PhD theses Active research committee and WMF support

Regular community-produced monthly newsletter http://enwp.org/meta:Research // @wikiresearch

Topics include: Community and content creation Reading and researching by users Quality of content Technical research Large-scale content examination

Page 10: Dissecting Wikipedia

Research on communities

Research on the Wikipedia communities:

Dynamics of community conflict, discussions, collaboration, voting, contribution, mentoring…

Demographics, motivation and specialisms of contributors Patterns of growth and content creation/deletion Effect of central programs on volunteer activity Cross-cultural interaction

Page 11: Dissecting Wikipedia

Visualisation: discussion dynamics

http://notabilia.net/

Page 12: Dissecting Wikipedia

Editor activity and motivation

http://commons.wikimedia.org/wiki/File:Effect_of_barnstars_on_productivity.png

Page 13: Dissecting Wikipedia

Research on users

Research on usage of Wikipedia:

Specific searching behaviour Patterns of usage (yearly, daily) Tracking external events through Wikipedia Search engine rankings Change in usage by students Effect of Wikipedia publication on wider literature

Page 14: Dissecting Wikipedia

Visualising editing patterns

http://commons.wikimedia.org/wiki/File:WikiTrip_egyptian_revolution_screenshot.png

Page 15: Dissecting Wikipedia

Research on content

Research on the content of Wikipedia:

Evolution of content Accuracy, coverage and quality Biases – geographic, cultural, gender Linguistic analysis Effect of external publications on Wikipedia

Page 16: Dissecting Wikipedia

Quality assessment comparisons

http://commons.wikimedia.org/wiki/File:Boxplot_of_Average_Article_Feedback_ratings_by_project_rated_quality.svg

Page 17: Dissecting Wikipedia

Research on technical aspects

Research on the technical side of Wikipedia:

Extensive work on scaling open-content services Tools for detecting and handling vandalism Algorithmic detection and identification of bias, spam Practical research on uses of wikis

Page 18: Dissecting Wikipedia

Research using content

Research using content from Wikipedia

Hard to distinguish from “conventional” research, but some examples:

Geographical analysis Visualisations of content Source for extracted datasets

...and Wikidata still to come!

Page 19: Dissecting Wikipedia

Visualising art history

http://commons.wikimedia.org/wiki/File:Wikiarthistory.png

Page 20: Dissecting Wikipedia

Visualising place

https://commons.wikimedia.org/wiki/File:Imageworld-artphp3.png