wikipedia architecture
TRANSCRIPT
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 1/19
W IKIPEDIA A RCHITECTURE Matej Ferenc
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 2/19
WIKIPEDIA
One of the top ten websites
Currently about 400 million unique visitors a
month
Over 100,000 hits per second Performance, caching, optimization
Supported by a non-profit organization
No expensive features
Balance between performance and features
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 3/19
HISTORY
Phase I: UseModWiki
Experiment, launched in January 2001
Written in Perl
Pages in individual text files
No history of changes
New features: free links – linking pages with special
syntax instead of automatic links
Influenced today’s Wikipedia markup language
Wikipedia hosted on a single server
Performance issues
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 4/19
HISTORY
Phase II: the PHP script
Magnus Manske, January 2002
PHP + MySQL
Namespaces to organize content
Special pages: maintenance reports, contributions list,user watchlist
Frequent difficulties
Phase III: MediaWiki
Lee Daniel Crocker, July 2002 "wasn't much time to sit down and properly architect
and develop a solution“
Tracking down slow functions
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 5/19
HISTORY
New server (still single)
Performance issues
Temporarily disabled “view count” and “site” statistics
Occasionally switched to read-only mode
In 2003 decision to re-architect software from scratch orcontinue to improve existing code
Database server added
Caching rendered (ready-to-output) pages
New features (automatically-generated table of contents,
editing page sections)
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 6/19
MEDIA WIKI
Not like generic CMS
No regular features like publication workflow or ACLs
Very specific purpose
Variety of tools to handle spam and vandalism
Open-source software from the beginning
Solid external user base
PHP
MediaWiki uses unprefixed class names
Namespace class renamed to MWNamespace to becompatible with PHP 5.3
Not the best choice for performance
Developers interested in feature development
architecture is left behind
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 7/19
SECURITY
Wrappers around HTML output and database queries
Removes "magic quotes" slashes
Strips illegal input characters
Normalizes Unicode sequences
Cross-site request forgery (CSRF) avoided by using tokens
Cross-site scripting (XSS) avoided by validating inputs and
escaping outputs
Database functions preventing SQL injection
Unregistered users IP addresses are logged when editing Blocking IP addresses for vandalism
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 8/19
D ATABASE
MySQL
Dozens of tables – content, users, media files, caching
Indices and summary tables used extensively
SQL queries that scan huge numbers of rows can be
very expensive
Unindexed queries are discouraged
1.4 model – content stored in two tables, cur and old.
Deleted pages in archive
1.5 model – content stored in three tables, page, revision
and text
text table – mapping IDs to text blobs, which contain few
dozen revisions. First revision stored in full, followingstored as diffs.
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 9/19
D ATABASE
Revisions are grouped per page, tend to be similar, diffs aresmall, gzip works well, compression ratio ~ 98%
Load balancing
One master database server, any number of slaves
All writes sent to master
Reads sent to slaves
Each slave has replication lag. If it exceeds 30s, slave willnot receive read queries to catch up. If all slaves arelagged more than 30s, system will put itself in read-onlymode
Chronology protector
Write query stores master’s position in the user’s session
After the user makes a read request, load balancer readsfrom slave that has caught up to the replication position
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 10/19
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 11/19
C ACHING
Most requests handled by caching proxies
(Squids)
Contain static versions of rendered pages, served for
simple reads to unlogged users
Logged-in users and other requests forwarded to theweb server
Second level caching
When assembling the page from multiple objects
Many can be cached to minimize future calls Page's interface – sidebar, menus, UI text
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 12/19
L ANGUAGES
Wikipedia is available in more than 280 languages
English – less than 20%
Provide localized interface
Localization and internationalization (l10n & i18n)
Central component of MediaWiki
Pervasive – impacts many parts of the software
UTF8 (since 2005)
Bidirectional text
LTR and RTL text on the same page Interface messages
key-values pairs
Language switches ({{GENDER:}}, {{PLURAL:}},{{GRAMMAR:}})
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 13/19
L ANGUAGES
Localizing messages
Localized interface messages in MessagesXx.php
Where Xx is the ISO-639 code of the language
Message files include language-dependent information such as
date formats, number formats, grammar conventions
Special MediaWiki website translatewiki.net
Translators can localize interface messages by editing a
wiki page
MessagesXx.php files are updated in the MediaWikicode repository
Documentation for every message
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 14/19
MEDIA FILES
Files stored on the file system
Thumbnails in dedicated thumb directory
Supports uncommon file types
Like SVG vector images
Rendered as PNG files, can be thumbnailed and
displayed inline
File is assigned a page with information entered
by the uploader
Includes copyright information (author, license)
Describes or classifies the content of the file
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 15/19
CONTENT PROCESSING
User-generated content isn't in HTML, but in
a markup language specific to MediaWiki
"wikitext“
Formatting changes (bold, italic using quotes)
Links, templates, context-dependent content (date orsignature),
Incredible number of other magical things
Content needs to be parsed, assembled from all the
external or dynamic pieces and converted to HTML
Parser – one of the most essential parts of MediaWiki
Difficult to change or improve
Has to remain extremely stable
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 16/19
M ARKUP LANGUAGE
No formal specification from the beginning
Started based on UseModWiki's markup
Morphed and evolved as needs have demanded
Complex language – can not be represented as a
formal grammar
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 17/19
MODIFIABILITY
Administrators can install extensions, configure
separate helper programs (image thumbnailing
and TeX rendering) and global settings
MediaWiki used to over-depend on global
variables
Slowly moving context out of global variables into
objects
Storing context in objects allows flexible reuse of objects
Some parts can be modified (database schema) While others can not (Parser)
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 18/19
S YSTEM ARCHITECTURE
DNS servers run PowerDNS Geographical DNS distribute requests between two main
sites (US and Europe) depending on the location of the client
Load-balancing on servers uses LVS
For HTML – Squid caching proxy servers in frontof Apache
For image files – Squid in front of Sun Java SystemWeb Server
Servers run Ubuntu Linux
Image file storage servers run Solaris Main web application is MediaWiki written in PHP
Structured data stored in MySQL. Wikis grouped intoclusters, each served by several MySQL serversreplicated in a single-master configuration.
8/3/2019 Wikipedia Architecture
http://slidepdf.com/reader/full/wikipedia-architecture 19/19