efficient and flexible text manipulation, spelling correction and page collections with pywikibot...

Post on 28-Dec-2015

221 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Efficient and flexible text manipulation, spelling

correction and page collections with Pywikibot

User:Bináris

Hungarian Wikipedia &

Pywikipedia developer teamWikimania 2012

From Budapest

Useful links

[[meta:User:Bináris]]

Just check it now on your laptop to follow me

What is this about?

My spellchecker underlined occurence.• Wiktionary:

Nounoccurence1.Common misspelling of occurrence.

• A search in English Wikipedia:Results 1–20 of 333,623 for occurence Does this include every erronious form?

We speak about

• Pywikipedia bot framework

• replace.py

• fixes.py

This works on every MediaWiki installation!

Some ideas

• Spellink corrections• Linking and unlinking• Mass change of section titles• Execution of naming conventions• Replacing templates• Replacing template parameters• Placing templates• Correcting link errors

6

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

7

The two-pass model of replacement

1. Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic

– Optionally uploading the list to your wiki(line numbers help to clean)

2. Making the actual replacementsFaster (or very fast) and attended

8

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

9

What is a fix?

• A fix contains a replacement task.

• See the links on my Meta page for description & examples

The magic of regular expressions

11

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

12

Regular expressions• color colour: this is concrete and accidental (and

uninteresting :-P)• What about changing

[[január 4]]. to [[január 4.]] and [[január 4]]-én to [[január 4.|január 4]]-én? (For all dates, of course)

• Or July 13, 2012 and 13 July 2012 to 2012-07-13 and7/13/2012 to 2012-07-13 (ISO 8601) within tables?

• Or color, Color, c/Colorful, c/Colorfulness to colour… (but not Colorado and colorectal cancer)?Note! Colorful (film) and (manga) and CSS colors go to exceptions! (Why? Sure? How to decide?)

13

Regular expressions• Regular expressions form a simple

programming language that searches for patterns and replaces with patterns.

• Learn them, they are worth! Another dimension of efficiency.

14

Example: search for a date

July 13, 2012 (a regex-like analysis)1. A month name (possibly in lower case or abbreviated

as Jul)

2. One or more or less spaces

3. 1…9 OR 0 followed by 1…9 OR 1 or 2 followed by 0…9 OR 3 followed by 0 or 1

4. Comma?

5. One or more or less spaces (not less without comma)

6. Maximum of four digits (1 and 2: are they worth?)

15

First theorem

The more hits and the more precise matching you want, the more complex the regex will be.

(Do you want to find july? Do you want to find July 13,2012? Do you want to find

Jul 13, 2012?)

16

Example: agents (search & replace)

'replacements': [

(ur'(FBI|CIA|KGB|MI ?\d) [üÜ]gynök(?!e)', ur'\1-ügynök'),

(ur'(FBI|CIA|KGB|MI ?\d\]\]) [üÜ]gynök(?!e)', ur'\1-ügynök'),

],

1. An agency (MI followed by an optional space and a digit)

2. A space

3. Ügynök OR ügynök, but NOT ügynöke (hyphen prohibited)

Second line: a linked agency

Result: a hyphenated, lower case agent (=ügynök in Hungarian)

NB it was preceeded by some searches! Not all agencies are here.

17

Example: exceptions with regexesBaseExceptions = {

'inside-tags': [

'hyperlink',

'interwiki',

],

'text-contains': [

ur'(?i)(\{\{szinnyei|\{\{pallas\}|\{\{fényes\}|\{\{vályi\}|Vályi András|Fényes Elek|\{\{sicc\})',

],

'inside': [

r'\{\{DEFAULTSORT:.*?\}\}', #A defaultsortban szándékosan ékezet nélküli szavak vannak.

ur'<ref name.*?>',

#Mindenféle idézősablonok:

ur'(?is)\{\{cite.*?\}\}', #Az összes citenyavalya sablon (nem mindig van szóköz)

ur'(?is)\{\{cit(lib|per).*?\}\}', #A CitLib és a CitPer (nem biztos a szóköz, lehet |)

ur'(?is)\{\{citation .*?\}\}',

],

'title': [

ur'\d{4} a jogalkotásban',

],

}

18

What is to be excepted?

• Keywords

19

Advanced level

• Fixes and functions – own Python functions

Workflow

21

Simple replacement tasks

• Find an idea• Create the replacement• Find a good selector (search*, category…)• Do the work with two fingers

(y/enter, then /enter)(asynchronous save!)

• Imagine this and next slide is a flowchart.

*Unfortunately, no regexes in MediaWiki search engine

22

Advanced replacements tasks• Find an idea• Create the first version of replacement• Test it as usual in software development

– Watch it working during collection– Create a test page with purposeful errors– Take care of [[link]]ed & [[link|piped]] versions!

• Found falses? Missing replacements? Is it too slow? Are the previous problems solved as far as possible? Refine your regexes and/or exceptions

• Press ctrl C, and da capo al fine• If the fix is good enough, begin the work.• Maintain fixes & exceptions continously

23

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

24

Why manually?

• Color as CSS property

• % next to a number – may be an operation

• Misspelled word – may be an example in a linguistic article or a quotation

• RESPONSIBILITY!RESPONSIBILITY!

25

Second theorem

Spelling corrections must be manually.

Period.

26

Semiautomatic running

• Ingredients:– A replacement task that runs almost always

correctly– One or more pizzas (depending on running time)

(possibly a bottle of beer, if you like it)– Your favourite music– Stable knowledge of where your Pause button is

27

Errors

• False positives• Conflicts (originated from false positives)• Missed matches• Simply bad replacement expression• Slow fix• Inappropriate automatic running• Unneccessary changing because of fatigue• Unneccessary changing because of incompetence

Change the bot owner!

28

Third theorem

The more hits you want, the more conflicts you get.

This is the game.

Find the balance.

Speed

30

Speed

• Complex fixes may run slower

• Exceptions make it slower

• Lookbehinds make it slower

• Recursive run and allowoverlap are definitely slow (risk of infinite loop!)

• Will be slow if the beginning of the expression has much more hits than the trailing (see examples in fixes.py)

31

Speed

Fast replacements take the titles from

• -search

• -cat & al

• -links

• -transcludes

• -file

etc.

32

The two-pass model of replacement

1. Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic

– Optionally uploading the list to your wiki

2. Making the actual replacementsFaster (or very fast) and attended

33

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Efficiency

35

What does it mean?

• Find as much occurrences as possible (even if agglutinated)

• Find as few false positives as possible

• Face as few correction conflicts as possible

• Give the appropriate replacement always

• Let the bot work quickly — don’t wait in front of the screen

36

Keys to efficiency

• If you find a very efficient replacement (near to 100%), do it separately before others in the same package – you will have less conflict (but you may collect them together)

• Too big packages may run slow and have a greater chance to cause correction conflicts. Sometimes it is worth to make smaller parts of them.

• Too small packages will use more dead time during preparation and execution. Sometimes it is worth to put them together.

• How to decide then? Just watch.

37

Keys to efficiency

• Use exceptions when appropriate. They will decrease false positives as well as correction conflicts. E.g.– Cite book, cite web, cite anything templates

– URLs, image names (even as template parameters and gallery images!)

– Templates marking pages out of your scope (old authors in Hungarian Wikipedia whose quotations contain old-style spelling)

– Titles marking pages out of your scope (year numbers in law in Hungarian Wikipedia)

• …and first of all: improve your regexes continously!

38

Keys to efficiency

• Once you found a false positive, save it for later use!-saveexc / -saveexcnew

• Then insert these titles into your exceptions.• Run searches before/during creation of a fix.• Don’t deal with tasks that are not worth a bot!• Use the two-pass model and the dump whenever

possible!

39

An ugly example

I have a fix to correct short and long i (i/í).

Argentína has an í, but often occurs in English and Spanish titles no regex for it, title exceptions must be used separate fix.

But they may be collected together.

40

A less ugly example

• replace.py ásnéven "ás néven" -search:másnéven -ns:0 -summary:"Helyesírás javítása kézi botszerkesztéssel: más néven„

live demo

41

Character encoding problems

• Keep your files in UTF-8, and don’t use Notepad of Windows

• E.g. setting in Notepad++:

42

Character encoding problems

• If it doesn’t work in command line, write a fix• If you can’t solve with a fix, use URL encoding

– replace.py -catr:Венгрия . @ -lang:ru -excepttext:"[[hu:" -save:magyarok.txt -always

– replace.py -catr:%D0%92%D0%B5%D0%BD%D0%B3%D1%80%D0%B8%D1%8F . @ -lang:ru -excepttext:"[[hu:" -save:magyarok.txt –always live demo

• You may store this in a script (import replace.py)

This is the way of page collections

Page collections

44

The two-pass model of replacement

1. Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic

– Optionally uploading the list to your wiki

2. Making the actual replacementsFaster (or very fast) and attended

45

A simple idea

1. Gathering candidates (possible to-be-replaced texts) to a fileRelatively slow and automatic

– Uploading the list to your wiki (this is the result!)

2. Nothing. You are ready.

46

Some ideas for page collections

• Scheme: some existing/missing text

• Articles related to Hungary in other Wikipedias (see above for ruwiki)

• The Redlist Project for animals and plants

• Articles with {{commons}} template, but without any image

• …let your phantasy go!

47

Useful links

[[meta:User:Bináris]]

Thank you for your attention!

48

PS – some thoughts months later

• Lookahead is faster than recursion or overlapping.

• If a function is called for each much, that makes the bot run really slowly.

• In such cases a separate „fellow fix” without function call for searching is useful for faster search.

top related