wikipedia architecture

19
 W IKIPEDIA A RCHITECTURE Matej Ferenc

Upload: matejferenc4458

Post on 06-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 1/19

 W IKIPEDIA A RCHITECTURE Matej Ferenc

Page 2: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 2/19

WIKIPEDIA  

One of the top ten websites

Currently about 400 million unique visitors a

month

Over 100,000 hits per second Performance, caching, optimization

Supported by a non-profit organization

No expensive features

Balance between performance and features

Page 3: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 3/19

HISTORY  

Phase I: UseModWiki

Experiment, launched in January 2001

Written in Perl

Pages in individual text files

No history of changes

New features: free links – linking pages with special

syntax instead of automatic links

Influenced today’s Wikipedia markup language 

Wikipedia hosted on a single server

Performance issues

Page 4: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 4/19

HISTORY  

Phase II: the PHP script

Magnus Manske, January 2002

PHP + MySQL

Namespaces to organize content

Special pages: maintenance reports, contributions list,user watchlist

Frequent difficulties

Phase III: MediaWiki

Lee Daniel Crocker, July 2002 "wasn't much time to sit down and properly architect

and develop a solution“ 

Tracking down slow functions

Page 5: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 5/19

HISTORY  

New server (still single)

Performance issues

Temporarily disabled “view count” and “site” statistics 

Occasionally switched to read-only mode

In 2003 decision to re-architect software from scratch orcontinue to improve existing code

Database server added

Caching rendered (ready-to-output) pages

New features (automatically-generated table of contents,

editing page sections)

Page 6: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 6/19

MEDIA WIKI 

Not like generic CMS

No regular features like publication workflow or ACLs

 Very specific purpose

 Variety of tools to handle spam and vandalism

Open-source software from the beginning

Solid external user base

PHP

MediaWiki uses unprefixed class names

Namespace class renamed to MWNamespace to becompatible with PHP 5.3

Not the best choice for performance

Developers interested in feature development

architecture is left behind

Page 7: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 7/19

SECURITY  

Wrappers around HTML output and database queries

Removes "magic quotes" slashes

Strips illegal input characters

Normalizes Unicode sequences

Cross-site request forgery (CSRF) avoided by using tokens

Cross-site scripting (XSS) avoided by validating inputs and

escaping outputs

Database functions preventing SQL injection

Unregistered users IP addresses are logged when editing Blocking IP addresses for vandalism

Page 8: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 8/19

D ATABASE 

MySQL

Dozens of tables – content, users, media files, caching

Indices and summary tables used extensively

SQL queries that scan huge numbers of rows can be

very expensive

Unindexed queries are discouraged

1.4 model – content stored in two tables, cur and old.

Deleted pages in archive

1.5 model – content stored in three tables, page, revision 

and text

text table – mapping IDs to text blobs, which contain few

dozen revisions. First revision stored in full, followingstored as diffs. 

Page 9: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 9/19

D ATABASE 

Revisions are grouped per page, tend to be similar, diffs aresmall, gzip works well, compression ratio ~ 98%

Load balancing

One master database server, any number of slaves

 All writes sent to master

Reads sent to slaves

Each slave has replication lag. If it exceeds 30s, slave willnot receive read queries to catch up. If all slaves arelagged more than 30s, system will put itself in read-onlymode

Chronology protector

Write query stores master’s position in the user’s session 

 After the user makes a read request, load balancer readsfrom slave that has caught up to the replication position

Page 10: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 10/19

 

Page 11: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 11/19

C ACHING 

Most requests handled by caching proxies

(Squids)

Contain static versions of rendered pages, served for

simple reads to unlogged users

Logged-in users and other requests forwarded to theweb server

Second level caching

When assembling the page from multiple objects

Many can be cached to minimize future calls Page's interface – sidebar, menus, UI text

Page 12: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 12/19

L ANGUAGES 

Wikipedia is available in more than 280 languages

English – less than 20%

Provide localized interface

Localization and internationalization (l10n & i18n)

Central component of MediaWiki

Pervasive – impacts many parts of the software

UTF8 (since 2005)

Bidirectional text

LTR and RTL text on the same page Interface messages

key-values pairs

Language switches ({{GENDER:}}, {{PLURAL:}},{{GRAMMAR:}})

Page 13: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 13/19

L ANGUAGES 

Localizing messages

Localized interface messages in MessagesXx.php

Where Xx is the ISO-639 code of the language

Message files include language-dependent information such as

date formats, number formats, grammar conventions

Special MediaWiki website translatewiki.net

Translators can localize interface messages by editing a

wiki page

MessagesXx.php files are updated in the MediaWikicode repository

Documentation for every message 

Page 14: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 14/19

MEDIA  FILES 

Files stored on the file system

Thumbnails in dedicated thumb directory

Supports uncommon file types

Like SVG vector images

Rendered as PNG files, can be thumbnailed and

displayed inline

File is assigned a page with information entered

by the uploader

Includes copyright information (author, license)

Describes or classifies the content of the file

Page 15: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 15/19

CONTENT PROCESSING 

User-generated content isn't in HTML, but in

a markup language specific to MediaWiki

"wikitext“ 

Formatting changes (bold, italic using quotes)

Links, templates, context-dependent content (date orsignature),

Incredible number of other magical things

Content needs to be parsed, assembled from all the

external or dynamic pieces and converted to HTML

Parser – one of the most essential parts of MediaWiki

Difficult to change or improve

Has to remain extremely stable

Page 16: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 16/19

M ARKUP LANGUAGE 

No formal specification from the beginning

Started based on UseModWiki's markup

Morphed and evolved as needs have demanded

Complex language – can not be represented as a

formal grammar

Page 17: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 17/19

MODIFIABILITY  

 Administrators can install extensions, configure

separate helper programs (image thumbnailing

and TeX rendering) and global settings

MediaWiki used to over-depend on global

variables

Slowly moving context out of global variables into

objects

Storing context in objects allows flexible reuse of objects

Some parts can be modified (database schema) While others can not (Parser)

Page 18: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 18/19

S YSTEM  ARCHITECTURE 

DNS servers run PowerDNS Geographical DNS distribute requests between two main

sites (US and Europe) depending on the location of the client

Load-balancing on servers uses LVS

For HTML – Squid caching proxy servers in frontof Apache

For image files – Squid in front of Sun Java SystemWeb Server

Servers run Ubuntu Linux

Image file storage servers run Solaris Main web application is MediaWiki written in PHP

Structured data stored in MySQL. Wikis grouped intoclusters, each served by several MySQL serversreplicated in a single-master configuration.

Page 19: Wikipedia Architecture

8/3/2019 Wikipedia Architecture

http://slidepdf.com/reader/full/wikipedia-architecture 19/19