efficient and flexible text manipulation, spelling correction and page collections with pywikibot...

48
Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot User :Bináris Hungarian Wikipedia & Pywikipedia developer team Wikimania 2012 From Budapest

Upload: aubrey-newman

Post on 28-Dec-2015

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

Efficient and flexible text manipulation, spelling

correction and page collections with Pywikibot

User:Bináris

Hungarian Wikipedia &

Pywikipedia developer teamWikimania 2012

From Budapest

Page 2: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

Useful links

[[meta:User:Bináris]]

Just check it now on your laptop to follow me

Page 3: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

What is this about?

My spellchecker underlined occurence.• Wiktionary:

Nounoccurence1.Common misspelling of occurrence.

• A search in English Wikipedia:Results 1–20 of 333,623 for occurence Does this include every erronious form?

Page 4: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

We speak about

• Pywikipedia bot framework

• replace.py

• fixes.py

This works on every MediaWiki installation!

Page 5: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

Some ideas

• Spellink corrections• Linking and unlinking• Mass change of section titles• Execution of naming conventions• Replacing templates• Replacing template parameters• Placing templates• Correcting link errors

Page 6: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

6

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 7: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

7

The two-pass model of replacement

1. Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic

– Optionally uploading the list to your wiki(line numbers help to clean)

2. Making the actual replacementsFaster (or very fast) and attended

Page 8: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

8

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 9: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

9

What is a fix?

• A fix contains a replacement task.

• See the links on my Meta page for description & examples

Page 10: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

The magic of regular expressions

Page 11: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

11

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 12: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

12

Regular expressions• color colour: this is concrete and accidental (and

uninteresting :-P)• What about changing

[[január 4]]. to [[január 4.]] and [[január 4]]-én to [[január 4.|január 4]]-én? (For all dates, of course)

• Or July 13, 2012 and 13 July 2012 to 2012-07-13 and7/13/2012 to 2012-07-13 (ISO 8601) within tables?

• Or color, Color, c/Colorful, c/Colorfulness to colour… (but not Colorado and colorectal cancer)?Note! Colorful (film) and (manga) and CSS colors go to exceptions! (Why? Sure? How to decide?)

Page 13: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

13

Regular expressions• Regular expressions form a simple

programming language that searches for patterns and replaces with patterns.

• Learn them, they are worth! Another dimension of efficiency.

Page 14: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

14

Example: search for a date

July 13, 2012 (a regex-like analysis)1. A month name (possibly in lower case or abbreviated

as Jul)

2. One or more or less spaces

3. 1…9 OR 0 followed by 1…9 OR 1 or 2 followed by 0…9 OR 3 followed by 0 or 1

4. Comma?

5. One or more or less spaces (not less without comma)

6. Maximum of four digits (1 and 2: are they worth?)

Page 15: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

15

First theorem

The more hits and the more precise matching you want, the more complex the regex will be.

(Do you want to find july? Do you want to find July 13,2012? Do you want to find

Jul 13, 2012?)

Page 16: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

16

Example: agents (search & replace)

'replacements': [

(ur'(FBI|CIA|KGB|MI ?\d) [üÜ]gynök(?!e)', ur'\1-ügynök'),

(ur'(FBI|CIA|KGB|MI ?\d\]\]) [üÜ]gynök(?!e)', ur'\1-ügynök'),

],

1. An agency (MI followed by an optional space and a digit)

2. A space

3. Ügynök OR ügynök, but NOT ügynöke (hyphen prohibited)

Second line: a linked agency

Result: a hyphenated, lower case agent (=ügynök in Hungarian)

NB it was preceeded by some searches! Not all agencies are here.

Page 17: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

17

Example: exceptions with regexesBaseExceptions = {

'inside-tags': [

'hyperlink',

'interwiki',

],

'text-contains': [

ur'(?i)(\{\{szinnyei|\{\{pallas\}|\{\{fényes\}|\{\{vályi\}|Vályi András|Fényes Elek|\{\{sicc\})',

],

'inside': [

r'\{\{DEFAULTSORT:.*?\}\}', #A defaultsortban szándékosan ékezet nélküli szavak vannak.

ur'<ref name.*?>',

#Mindenféle idézősablonok:

ur'(?is)\{\{cite.*?\}\}', #Az összes citenyavalya sablon (nem mindig van szóköz)

ur'(?is)\{\{cit(lib|per).*?\}\}', #A CitLib és a CitPer (nem biztos a szóköz, lehet |)

ur'(?is)\{\{citation .*?\}\}',

],

'title': [

ur'\d{4} a jogalkotásban',

],

}

Page 18: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

18

What is to be excepted?

• Keywords

Page 19: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

19

Advanced level

• Fixes and functions – own Python functions

Page 20: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

Workflow

Page 21: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

21

Simple replacement tasks

• Find an idea• Create the replacement• Find a good selector (search*, category…)• Do the work with two fingers

(y/enter, then /enter)(asynchronous save!)

• Imagine this and next slide is a flowchart.

*Unfortunately, no regexes in MediaWiki search engine

Page 22: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

22

Advanced replacements tasks• Find an idea• Create the first version of replacement• Test it as usual in software development

– Watch it working during collection– Create a test page with purposeful errors– Take care of [[link]]ed & [[link|piped]] versions!

• Found falses? Missing replacements? Is it too slow? Are the previous problems solved as far as possible? Refine your regexes and/or exceptions

• Press ctrl C, and da capo al fine• If the fix is good enough, begin the work.• Maintain fixes & exceptions continously

Page 23: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

23

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 24: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

24

Why manually?

• Color as CSS property

• % next to a number – may be an operation

• Misspelled word – may be an example in a linguistic article or a quotation

• RESPONSIBILITY!RESPONSIBILITY!

Page 25: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

25

Second theorem

Spelling corrections must be manually.

Period.

Page 26: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

26

Semiautomatic running

• Ingredients:– A replacement task that runs almost always

correctly– One or more pizzas (depending on running time)

(possibly a bottle of beer, if you like it)– Your favourite music– Stable knowledge of where your Pause button is

Page 27: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

27

Errors

• False positives• Conflicts (originated from false positives)• Missed matches• Simply bad replacement expression• Slow fix• Inappropriate automatic running• Unneccessary changing because of fatigue• Unneccessary changing because of incompetence

Change the bot owner!

Page 28: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

28

Third theorem

The more hits you want, the more conflicts you get.

This is the game.

Find the balance.

Page 29: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

Speed

Page 30: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

30

Speed

• Complex fixes may run slower

• Exceptions make it slower

• Lookbehinds make it slower

• Recursive run and allowoverlap are definitely slow (risk of infinite loop!)

• Will be slow if the beginning of the expression has much more hits than the trailing (see examples in fixes.py)

Page 31: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

31

Speed

Fast replacements take the titles from

• -search

• -cat & al

• -links

• -transcludes

• -file

etc.

Page 32: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

32

The two-pass model of replacement

1. Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic

– Optionally uploading the list to your wiki

2. Making the actual replacementsFaster (or very fast) and attended

Page 33: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

33

Decisions

1. Command line parameters or fix?

2. Searching in live wiki or in dump?

3. Search & replace in one run or separately?

4. Simple text replacements or regular expressions?

5. Manual or automatic running?

Page 34: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

Efficiency

Page 35: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

35

What does it mean?

• Find as much occurrences as possible (even if agglutinated)

• Find as few false positives as possible

• Face as few correction conflicts as possible

• Give the appropriate replacement always

• Let the bot work quickly — don’t wait in front of the screen

Page 36: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

36

Keys to efficiency

• If you find a very efficient replacement (near to 100%), do it separately before others in the same package – you will have less conflict (but you may collect them together)

• Too big packages may run slow and have a greater chance to cause correction conflicts. Sometimes it is worth to make smaller parts of them.

• Too small packages will use more dead time during preparation and execution. Sometimes it is worth to put them together.

• How to decide then? Just watch.

Page 37: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

37

Keys to efficiency

• Use exceptions when appropriate. They will decrease false positives as well as correction conflicts. E.g.– Cite book, cite web, cite anything templates

– URLs, image names (even as template parameters and gallery images!)

– Templates marking pages out of your scope (old authors in Hungarian Wikipedia whose quotations contain old-style spelling)

– Titles marking pages out of your scope (year numbers in law in Hungarian Wikipedia)

• …and first of all: improve your regexes continously!

Page 38: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

38

Keys to efficiency

• Once you found a false positive, save it for later use!-saveexc / -saveexcnew

• Then insert these titles into your exceptions.• Run searches before/during creation of a fix.• Don’t deal with tasks that are not worth a bot!• Use the two-pass model and the dump whenever

possible!

Page 39: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

39

An ugly example

I have a fix to correct short and long i (i/í).

Argentína has an í, but often occurs in English and Spanish titles no regex for it, title exceptions must be used separate fix.

But they may be collected together.

Page 40: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

40

A less ugly example

• replace.py ásnéven "ás néven" -search:másnéven -ns:0 -summary:"Helyesírás javítása kézi botszerkesztéssel: más néven„

live demo

Page 41: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

41

Character encoding problems

• Keep your files in UTF-8, and don’t use Notepad of Windows

• E.g. setting in Notepad++:

Page 42: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

42

Character encoding problems

• If it doesn’t work in command line, write a fix• If you can’t solve with a fix, use URL encoding

– replace.py -catr:Венгрия . @ -lang:ru -excepttext:"[[hu:" -save:magyarok.txt -always

– replace.py -catr:%D0%92%D0%B5%D0%BD%D0%B3%D1%80%D0%B8%D1%8F . @ -lang:ru -excepttext:"[[hu:" -save:magyarok.txt –always live demo

• You may store this in a script (import replace.py)

This is the way of page collections

Page 43: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

Page collections

Page 44: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

44

The two-pass model of replacement

1. Gathering candidates (possible to-be-replaced texts) to a file-save / -savenewRelatively slow and automatic

– Optionally uploading the list to your wiki

2. Making the actual replacementsFaster (or very fast) and attended

Page 45: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

45

A simple idea

1. Gathering candidates (possible to-be-replaced texts) to a fileRelatively slow and automatic

– Uploading the list to your wiki (this is the result!)

2. Nothing. You are ready.

Page 46: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

46

Some ideas for page collections

• Scheme: some existing/missing text

• Articles related to Hungary in other Wikipedias (see above for ruwiki)

• The Redlist Project for animals and plants

• Articles with {{commons}} template, but without any image

• …let your phantasy go!

Page 47: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

47

Useful links

[[meta:User:Bináris]]

Thank you for your attention!

Page 48: Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot UserUser:Bináris Hungarian Wikipedia & Pywikipedia developer

48

PS – some thoughts months later

• Lookahead is faster than recursion or overlapping.

• If a function is called for each much, that makes the bot run really slowly.

• In such cases a separate „fellow fix” without function call for searching is useful for faster search.