Meidogger:Winterbot/AutoFormatcode
Appearance
(Taken from Robert Ullmann's famous bot on en wikt)
notice
[bewurkje]This comes with several very important caveats:
- I am a professional software engineer, this is what I do; however this code was written for my own use, and is not warranted, and does not carry any implication of merchantability or fitness for use.
- Like everything else in the Wiktionary, this is under GFDL. GFDL is not compatible with the GPL, this document is not licensed under the GPL as software. (!)
- At any given moment, this code may not represent what is being run; I have no intention of updating this page every time I make a change.
technical notes
[bewurkje]I don't have my ego attached to code I write; I routinely dump code that has gotten too complex, and re-write it. On the other hand, even if something is sloppy, if it is tested and works, I leave it alone.
- Some of the comments may be snarky.
- The comments are often (usually) written to remind me of something, not to explicate the code.
- Since I modify this regularly, there is code that is not reached or otherwise redundant.
- The pre-parsing should go deeper; a fairly major restructuring would be helpful at some point soon.
- There are a small number of known (to me ;-) bugs that I handle by monitoring the edits done, having not yet fixed them. (Like handling multi-line comments.)
- The wikipedia.py module AF uses is heavily modified from the distro; however the interface is the same. In the presence of network problems/failures/outages AF may abort when the modified version would have recovered. The exceptions thrown are the same, but under differing conditions.
- On Linux, the clock timing works, but will display ugly large values.
- The code to handle headers is largely hacked to implement the "Connel" flag ....
- The code that handles Etymology headers is based on the current WT:ELE; there is no problem changing it when we figure out how Etymology and Pronunciation are supposed to play nicely together in the general case.
- It must have a sysop account as well, to read patrolled flags in RC; "enhanced" RC mode must be turned off.
outline
[bewurkje]- prescreen
Reads the XML dump, uses simple regex to find entries that may need attention, and builds a random index
- rcpages
Generator called by the main routine. Calls prescreen, then cycles through reading Recent Changes, looking at the request category, and yielding pages found.
- main
Reads configuration pages, builds tables to be used. Loops on rcpages generator, for each entry:
- runs regex on the entire text
- breaks entry into language sections, plus prolog (above first section), and iwikis
- in each language section:
- looks for and fixes Etymology headers
- herds cats
- fixes bad headers
- fixes linking in trans tables
- fixes top to trans-top
- subst's (replaces) language code template
- etc
- then reassembles the entry, removing multiple blank lines, adding ---- rules, and so on
- checks the actions performed
- if any resulting action, rewrites the page
code
[bewurkje]#!/usr/bin/python
# -*- coding: utf-8 -*-
############################################################
# #
# NOTE: utf-8 doesn't work for me on fy wikt; Latin1 does. #
# #
############################################################
"""
This bot looks for entries tagged for autoformatting, does a number of tasks
No command line arguments.
"""
import wikipedia
import catlib
import sys
import re
import pickle
import time
import xmlreader
import socket
def safe(s):
return pickle.dumps(s)[1:-5]
def lkey(l):
n = l.strip('[]')
if n == 'Translingual': return '0' + n
if n == 'English': return '1' + n
# bad L2 headers
if n.lower() == 'cyrillic alphabet': return '0' + n
if n.lower() == 'arabic alphabet': return '0' + n
if n.lower() == 'see also': return '3' + n
if n.lower() == 'references': return '4' + n
return '2' + n
PSK = { }
from random import random
AH = set()
#newpages = set()
def prescreen():
# get XML dump
dump = xmlreader.XmlDump("../hancheck/en-wikt.xml")
srx = { }
srx['lcode header'] = re.compile(r'^== *\{\{.{2,3}\}\} *==', re.M)
srx['lcode trans'] = re.compile(r'^\* *\{\{.{2,3}\}\} *:', re.M)
srx['top template w/param'] = re.compile(r'\{\{top\|')
srx['top template w/semi gloss'] = re.compile(r'^;.*\n\{\{top\}', re.M)
srx['top template w/qbold gloss'] = re.compile(r"'''\n\{\{top\}")
srx['gender'] = re.compile(r"^\*.*:.*''[fmcn]''", re.M)
# srx['wikipediapar'] = re.compile(r"\{\{[Ww]ikipediapar")
srx['Wikipedia'] = re.compile(r"\{\{Wikipedia")
srx['Unicode'] = re.compile(r"\{\{Unicode")
srx['Acronym'] = re.compile(r"\{\{Acronym")
srx['Initialism'] = re.compile(r"\{\{Initialism")
srx['Abbreviation'] = re.compile(r"\{\{Abbreviation")
srx['cattag'] = re.compile(r"\{\{cattag")
srx['trad'] = re.compile(r"\{\{trad-?\|")
srx['rest after header'] = re.compile('^=+[^=\n]+=+[^=\n]+$', re.M)
srx['Pronounciation'] = re.compile('Pronounciation')
srx['categorized'] = re.compile('[Cc]ategori[sz]ed')
# srx['transitive verb'] = re.compile('=== ?(Int|T)ransitive verb', re.I)
srx['etymology with parens'] = re.compile('Etymology ?\(')
srx['etymology at L4'] = re.compile('^==== ?Etymology', re.M)
srx['also see'] = re.compile('= ?Also see')
srx['indented see'] = re.compile(r'^:\{\{see\|', re.M)
srx['indented Cyrillic'] = re.compile(r'^:Cyrillic', re.M)
srx['indented Roman'] = re.compile(r'^:Roman', re.M)
# next really needs something re lang sects, try it for now, sorta works
# srx['maybe un-numbered ety'] = re.compile(r'^=== ?Etymology ?===.*Etymology', re.M|re.S)
srx['PAGENAME'] = re.compile('\{PAGENAME')
srx['-----'] = re.compile('-----')
# header case problems ...
srx['lc header'] = re.compile(r'^={1,6} ?[a-z][-a-zA-Z ]*=+$', re.M)
srx['non sc header'] = re.compile(r'^={3,6} ?[A-Z][-a-z ]*[A-Z][-a-zA-Z ]*=+$', re.M)
# contexts
srx['context tag'] = re.compile(r"^# *\(''.+?''\)", re.M)
srx['context tag 2'] = re.compile(r"^# *''\(.+?\)''", re.M)
srx['context italbrac'] = re.compile(r"^# *\{\{italbrac", re.M)
# re-work rfc level tags (not for now! don't go looking for them)
# srx['rfc level'] = re.compile(r'^\{\{rfc-level.*\+',re.M)
# srx['X phrase'] = re.compile(r'^={3,5} *[-a-zA-Z ]* phrase *=+$', re.M)
# and so on
srx['chu nom crap'] = re.compile(r'^===\s*chu nom\s*===', re.M|re.I)
reah = re.compile(r'^={3,6} *([-a-zA-Z ]+) *=+$', re.M)
counts = { 'at random':0, 'unknown header':0 }
for r in srx: counts[r] = 0
entries = 0
tags = 0
piscine = set()
# skip a few others besides the level 3-6 headers
AH.add('Mandarin')
AH.add('Cantonese')
AH.add('Min Nan')
for entry in dump.parse():
text = entry.text
title = entry.title
if title.find(':') >= 0: continue
if text and text[0:1] == '#': continue
entries += 1
if entries % 10000 == 0: print "prescreen: %d entries, %d tagged" % (entries, tags)
# screen entries:
tag = False
for reason in srx:
if srx[reason].search(text):
# most of context, but not all ...
if reason.startswith('context') and random() > 0.7: continue
tag = True
break
if not tag and random() < 0.0035:
reason = 'at random'
tag = True
if not tag and text.find('{{rfc') < 0:
for mo in reah.finditer(text):
h = mo.group(1).strip()
if h not in AH:
if h not in piscine:
print "header %s tagged" % safe(h)
piscine.add(h)
reason = "unknown header"
tag = True
if tag:
counts[reason] += 1
# print 'prescreen: %s selected %s' % (safe(title), reason)
PSK[random()] = (title, reason)
tags += 1
for r in sorted(counts):
print 'count for %s is %d' % (r, counts[r])
def now(): return int(time.clock())
def rcpages(site):
# generator which yields recentchanges, but not unpatrolled changes
# also entries in category
# in between, yields pages that satisfy the prescreen in random order
prescreen()
site = wikipedia.getSite("en", "wiktionary")
cat = catlib.Category(site, "Category:Requests for autoformat")
seen = set()
nextcat = now() - 1
nextrc = now() - 1
hold = { }
rcex = re.compile(r'<li>.*?<a href=".+?" title="(?P<title>.+?)">(?P=title)</a>')
for r in sorted(PSK):
# sorted!
title, reason = PSK[r]
seen.add(title)
print '(%d, from prescreen %s)' % (now(), reason)
page = wikipedia.Page(site, title)
yield page
nf = 0
nd = 0
# get our category, every 10-15 minutes or so
if now() > nextcat:
nextcat = now() + 700
cat.catlist(purge = True)
for page in cat.articles():
nf += 1
# time.sleep(3)
print '(%d)' % now()
seen.add(page.title)
if page.title in hold: del hold[page.title]
yield page
# recent changes
if now() > nextrc:
print '(%d, reading recent changes)' % now()
try:
rct = site.getUrl("/w/index.php?title=Special:Recentchanges&limit=500", sysop = True)
except wikipedia.NoPage:
print "Can't get recentchanges from en.wikt!"
rct = ''
time.sleep(30)
continue
nextrc = now() + 610
ht = 480
for line in rct.splitlines():
# if line.startswith('<li>'): print safe(line)
if line.find('Deletion log') > 0: continue
mo = rcex.match(line)
if mo:
title = mo.group('title')
if title.find(':') >= 0: continue
if title.find('/') >= 0: continue
#if line.find('<span class="newpage">N</span>') > 0: newpages.add(title)
if line.find('"unpatrolled"') > 0: continue
if line.find('User talk:AutoFormat') > 0: seen.add(title)
if title not in seen:
seen.add(title)
hold[title] = now() + ht
# scatter out into future ...
ht += 34
nf += 1
print "found: [%s] hold until %d" % (safe(title), hold[title])
pastime = now()
for title in sorted(hold):
# 10 on a pass is enough
if nd > 9: break
if hold[title] > pastime: continue
del hold[title]
nd += 1
# time.sleep(3)
print '(%d)' % now()
page = wikipedia.Page(site, title)
yield page
if nd == 0:
print "(%d, sleeping)" % now()
# usually 10 to 20 seconds, but rely on put throttle
time.sleep(70)
print '(%d, %d entries held)' % (now(), len(hold))
continue
# now have some serious recursion fun!
# fuzzy returns string match score
def fuzzy(a, b):
if not a or not b: return 0
if a[0:1] == b[0:1]: return 1 + fuzzy(a[1:], b[1:])
# try skip this char
sk = fuzzy(a[1:], b[1:])
# try with each char forward
p = a.find(b[0:1])
if p >= 0: sca = 1 + fuzzy(a[p+1:], b[1:])
else: sca = 0
p = b.find(a[0:1])
if p >= 0: scb = 1 + fuzzy(b[p+1:], a[1:])
else: scb = 0
return max(sk, sca, scb)
def infline(title, lang, header):
pos = header.lower()
if pos.startswith('{{'):
pos = pos[2:-2].split('|')[0]
if lang == 'en':
if pos in ['verb', 'noun', 'adjective', 'adverb']:
return "{{infl|en|" + pos + "}}[[Category:English "+ pos +"s that lack inflection template]]"
a = ord(title[0:1])
# Arabic:
if (a >= 0x0600 and a < 0x0780):
return "{{infl|%s|%s|sc=Arab}}" % (lang, pos)
# Han:
# this is planes 1-2, needs closer check
if (a >= 0x3400 and a < 0xA000) or (a >= 0xd800 and a < 0xdc00):
if lang == 'ko':
return "{{infl|%s|%s|sc=Hant}}{{ko-attention|may need inflection template}}" % (lang, pos)
elif lang == 'ja':
return "{{infl|%s|%s|sc=Jpan}}{{ja-attention|needs inflection template}}" % (lang, pos)
elif lang == 'vi':
return "{{infl|%s|%s|sc=Hant}}{{vi-attention|may need inflection template}}" % (lang, pos)
else:
return "{{infl|%s|%s|sc=Hani}}{{zh-attention|needs inflection template}}" % (lang, pos)
if lang == 'ja':
return "{{infl|%s|%s}}{{ja-attention|needs inflection template}}" % (lang, pos)
if lang == 'ko':
return "{{infl|%s|%s}}{{ko-attention|may need inflection template}}" % (lang, pos)
if lang in ['zh', 'cmn', 'yue', 'nan']:
return "{{infl|%s|%s}}{{zh-attention|may need inflection template}}" % (lang, pos)
return "{{infl|%s|%s}}" % (lang, pos)
MOD = [ 'chiefly', 'coarse', 'especially', 'extremely', 'frequently', 'generally', 'mainly', 'markedly',
'mildly', 'mostly', 'often', 'particularly', 'primarily', 'sometimes', 'usually', 'very' ]
reunlink = re.compile(r'\[\[(.*?)\]\]')
PRETULIP = ('of ', 'by ')
def cpar(cstr, ctxs):
# convert context string to template name(s)
tname = ''
cstr = re.sub(r'[,;\|]+', ',', cstr)
for cs in cstr.split(','):
cs = cs.strip(" '")
if cs.find('[') >= 0: cs = reunlink.sub(r'\1', cs)
# handles n modifiers, does context? yes.
while cs.split(' ')[0].lower() in MOD:
mod = cs.split(' ')[0].lower()
tname += mod + '|'
cs = cs[len(mod):].strip()
if cs.lower() in ctxs:
tname += ctxs[cs.lower()] + '|'
elif cs.startswith(PRETULIP):
tname += cs + '|'
else: return ''
tname = tname.rstrip('|')
if tname.startswith(PRETULIP): tname = 'context|' + tname
return tname
def ibsub(imo):
# some prefix captured
pref = imo.group(1)
istr = imo.group(2)
s = reunlink.sub(r'\1', istr)
# not general enough, bar pipes in match for now in re precomp
#if s != istr and s.find('|') >= 0: s = s.split('|')[1]
s = re.sub(r',\s*', '|', s)
if imo.group(3) == ':':
return pref + '{{i-c|' + s + '}}'
else:
return pref + '{{i|' + s + '}}'
def sdif(a, b):
# returns -(a stuff) +(b stuff) when one change
i = 0
while a[i:i+1] and a[i:i+1] == b[i:i+1]: i += 1
an = a[i:]
bn = b[i:]
i = 1
while i < len(an) and an[-i:] == bn[-i:]: i += 1
i -= 1
if i: return '-' + an[:-i] + ' +' + bn[:-i]
else: return '-' + an + ' +' + bn
# okay, try that! not so pretty is it?
def main():
socket.setdefaulttimeout(30)
# regex precomp
reL2head = re.compile(r'==?\s*([^=]+)={1,6}(.*)')
reheader = re.compile(r'(={3,6})\s*([^=]+)={1,6}(.*)')
reiwiki = re.compile(r'\[\[[-a-z]{2,11}:(.*)\]\]')
recat = re.compile(r'\[\[category:.*?\]\]', re.I)
retrans1 = re.compile(r'\*\s*\[\[w:.+\|([^\]]+?)\]\]\s*:(.*)')
retrans2 = re.compile(r'\*\s*\[\[([^\]]+?)\]\]\s*:(.*)')
retrans3 = re.compile(r'\*\s*([^:]+?):(.*)')
retag = re.compile(r'\{\{rfc-auto(\|.*?|)}}')
regender = re.compile(r"''([mfcn])''")
reglossfix = re.compile(r'(.+)\(\d+\)$')
retopgloss = re.compile(r'\{\{top(\|.*?|)}}$')
recontext = re.compile(r"^# *\(''(.+?)''\):? ?(.*)$", re.M)
recontext2 = re.compile(r"^# *''\((.+?)\):?'' ?(.*)$", re.M)
recontext3 = re.compile(r"^# *\{\{italbrac\|([^}]+?)}}:? ?(.*)$", re.M)
# be careful to match and remove newline in these:
rerfclevel = re.compile(r"^\{\{rfc-level\|.*\+.*\}\}\n", re.M)
rerfcxphrase = re.compile(r"^\{\{rfc-xphrase\|.*\}\}\n", re.M)
rerfcheader = re.compile(r"^\{\{rfc-header\|.*\}\}\n", re.M)
# italbracs not on context/defn lines, template italbrac->i replacement separate
# limited forms ... nowilink with pipes, no templates, look for : in mo.g3
# look for gloss, etc, * lines to start ...
reibcomma = re.compile(r"^(\*\s*)\(''([^\)^'^\|^\{]+):?''\)(:?)")
reibcomma2 = re.compile(r"^(\*\s*)''\(([^\)^'^\|^\{]+):?\)''(:?)")
# regex table (dict, name = tuple of compiled object and replacement)
Regex = { }
Regex['subst:PAGENAME'] = (re.compile(r'\{\{PAGENAME}}'), '{{subst:PAGENAME}}')
Regex['template -cattag +context'] = (re.compile(r'\{\{cattag\|'), '{{context|',)
Regex['template -Unicode +unicode'] = (re.compile(r'\{\{Unicode\|'), '{{unicode|',)
#Regex['template -wikipediapar +wikipedia'] = (re.compile(r'\{\{[Ww]ikipediapar([\|\}])'), r'{{wikipedia\1')
Regex['template -Wikipedia +wikipedia'] = (re.compile(r'\{\{Wikipedia([\|\}])'), r'{{wikipedia\1')
Regex['template -Acronym +acronym'] = (re.compile(r'\{\{Acronym([\|\}])'), r'{{acronym\1')
Regex['template -Initialism +initialism'] = (re.compile(r'\{\{Initialism([\|\}])'), r'{{initialism\1')
Regex['template -Abbreviation +abbreviation'] = (re.compile(r'\{\{Abbreviation([\|\}])'), r'{{abbreviation\1')
Regex['template -trad +t'] = (re.compile(r'\{\{trad\|'), '{{t|',)
Regex['template -trad- +t-'] = (re.compile(r'\{\{trad-\|'), '{{t-|',)
Regex['un-indent {{see}} template'] = (re.compile(r'^:\{\{see\|', re.M), '{{see|')
# make sure we are logged in
site = wikipedia.getSite("en", "wiktionary")
site.forceLogin(sysop = True)
site.forceLogin(sysop = False)
# get our config pages, throw exceptions: we have to stop if we can't read these
# use sysop to bypass "page is locked?" exception
print "read languages"
page = wikipedia.Page(site, "User:AutoFormat/Languages")
langtab = page.get(sysop = True)
print "read headers"
page = wikipedia.Page(site, "User:AutoFormat/Headers")
headtab = page.get(sysop = True)
print "read Top40"
page = wikipedia.Page(site, "Wiktionary:Translations/Wikification")
top40tab = page.get(sysop = True)
print "read contexts"
page = wikipedia.Page(site, "User:AutoFormat/Contexts")
ctxtab = page.get(sysop = True)
Lcodes = { }
Ltocode = { }
relangtab = re.compile(r'\| (.*?)\|\|(.*)')
i = 0
for line in langtab.splitlines():
mo = relangtab.match(line)
if mo:
for code in mo.group(1).split(','):
Lcodes[code.strip()] = mo.group(2).strip()
i += 1
Ltocode[mo.group(2).strip()] = mo.group(1).split(',')[0].strip()
print "found %d language codes" % i
# treat a couple of other codes as Mandarin etc, since they are in cats:
Lcodes['zh-cn'] = 'Mandarin'
Lcodes['zh-tw'] = 'Mandarin'
Lcodes['nan-cn'] = 'Min Nan'
Lcodes['nan-tw'] = 'Min Nan'
Lcodes['yue-cn'] = 'Cantonese'
Lcodes['yue-hk'] = 'Cantonese'
Level = { }
L43 = { }
POS = { }
EOS = [ 'See also', 'References', 'External links', 'Anagrams']
TOS = [ 'Pronunciation', 'Alternative spellings', 'Alternative forms']
HAN = ['Han character', 'Kanji', 'Hanzi', 'Hanza']
HT = ( '{{abbreviation', '{{initialism', '{{acronym', '{{numeral' )
NS = { }
Hfix = { }
reheadtab = re.compile(r'\| (.*?)\|\|\s*([1-5/]*)\s*\|\|(.*?)\|\|(.*?)\|\|(.*)')
i = 0
for line in headtab.splitlines():
mo = reheadtab.match(line)
if mo:
header = mo.group(1).strip()
if mo.group(2).strip() == '4/3':
L43[header] = True
Level[header] = 4
print "header %s is 4/3" % header
else: Level[header] = int(mo.group(2))
if mo.group(3).strip() == 'NS': ns = NS[header] = True
else: ns = False
if mo.group(4).strip() == 'POS': POS[header] = True
for variant in mo.group(5).split(','):
variant = variant.lower().strip()
if not variant: continue
Hfix[variant] = header
if not ns:
if variant[-1:] == 's': Hfix[variant[:-1]] = header
else: Hfix[variant + 's'] = header
Hfix[header.lower()] = header
if not ns:
if header[-1:] == 's': Hfix[header.lower()[:-1]] = header
else: Hfix[header.lower() + 's'] = header
i += 1
print "found %d headers" % i
# lots of possible ety sects, 1 to 24
for i in range(1, 25):
Hfix['etymology %d'%i] = 'Etymology %d'%i
Level['Etymology %d'%i] = 3
Top40 = { }
Classics = { }
retop40tab = re.compile(r'\*\s*(.*)')
i = j = 0
inT40 = True
for line in top40tab.splitlines():
if line.startswith('----'): inT40 = False
mo = retop40tab.match(line)
if mo:
lang = mo.group(1).strip(' []')
else: continue
if inT40:
Top40[lang] = True
i += 1
else:
if lang in Top40:
print "language %s in both Top40 and Classics?" % safe(lang)
continue
Classics[lang] = True
j += 1
print "found %d Top 40 languages" % i
print "found %d Classic languages" % j
# add all other known languages not in Top40:
i = 0
for code in Lcodes:
lang = Lcodes[code]
if lang not in Top40 and lang not in Classics:
if lang == 'English': continue
Classics[lang] = True
i += 1
print "added Classic: %s" % safe(lang)
print "added %d languages to Classics" % i
Contexts = { }
rectxtab = re.compile(r"\|\s*''(.*?)''\s*\|\|(.*)")
i = 0
for line in ctxtab.splitlines():
mo = rectxtab.match(line)
if mo:
m1 = mo.group(1).strip()
m2 = mo.group(2).strip()
if not m1 or not m2: continue
# only use first, table at top over-rides auto, templates over-ride redirects
if m1 not in Contexts: Contexts[m1] = m2
i += 1
print "found %d context templates" % i
# turn on/off for now
contextp = True
entries = 0
fixed = 0
# (specific stats)
# Set up set of all headers that are valid (at L3 or higher)
for header in Level:
AH.add(header)
# Sigh. True means prohibited from changing 4/3 levels
Connel = True
for page in rcpages(site):
title = page.title()
print "page %s" % safe(title)
if title.find(':') >= 0:
print "not in main namespace"
continue
if title.find('/') >= 0:
print "apparent subpage"
continue
entries += 1
try:
text = page.get()
except wikipedia.NoPage:
print "Can't get %s from en.wikt" % safe(title)
text = ''
continue
except wikipedia.IsRedirectPage:
print "Redirect page %s" % safe(title)
text = ''
continue
acts = set()
mo = retag.search(text)
if mo:
if mo.group(1).strip(' |'):
acts.add('rm tag:' + mo.group(1).strip(' |'))
else:
acts.add('rm tag')
text = retag.sub('', text)
# rfc level trickery
newtext = rerfclevel.sub('', text)
if newtext != text:
print 'took out rfc-level'
acts.add('rm rfc-level tag')
text = newtext
# same for xphrase
newtext = rerfcxphrase.sub('', text)
if newtext != text:
print 'took out rfc-xphrase'
acts.add('rm rfc-xphrase tag')
text = newtext
# same for header
newtext = rerfcheader.sub('', text)
if newtext != text:
print 'took out rfc-header'
acts.add('rm rfc-header tag')
text = newtext
if text.find('{{rfc') >= 0: rfc = True
#elif text.find('{{rfc|') >= 0: rfc = True
#elif text.find('{{rfc-') >= 0: rfc = True
else: rfc = False
rfcact = ''
# overall regex, using table
for rx in Regex:
newtext = Regex[rx][0].sub(Regex[rx][1], text)
if newtext != text:
acts.add(rx)
text = newtext
# categories found in the entry or implied by context and perhaps inflection templates
catseen = set()
# now parse. take the entry apart into languages (ha!)
curr = '*prolog'
last = ''
Lsect = { '*prolog':[ ], '*iwiki':[ ] }
Lcats = { }
waslinked = [ ]
divs = 0
for line in text.splitlines():
# L2 headers
mo = reL2head.match(line)
if mo:
header = mo.group(1).strip()
hf = reunlink.sub(r'\1', header)
if hf != header:
if hf.find('|') >= 0: hf = hf.split('|')[1]
if hf not in Top40: waslinked.append(hf)
else: acts.add('unlink language header ' + hf)
header = hf
# validate language
if header in Level:
if not rfc:
text = '{{rfc|' + header + ' as level 2 header}}\n' + text
rfcact = 'add rfc tag for L1/2 header ' + header
rfc = True
else:
print "(no edit, bad L2 header and rfc)"
rfcact = 'bad L1/2 header ' + header
# subst code template
if header[0:2] == '{{':
if header[2:-2] in Lcodes:
hf = Lcodes[header[2:-2]]
acts.add('L2 header -' + header + ' +' + hf)
header = hf
# check sort order
if header and last and lkey(header) < lkey(last):
acts.add(last + '/' + header + ' sorted into order')
last = header
if header not in Lsect:
Lsect[header] = [ ]
Lcats[header] = [ ]
else:
acts.add('merged ' + header + ' sections')
curr = header
if mo.group(2).strip():
acts.add('stuff after L2 header moved')
Lsect[curr].append(mo.group(2).strip())
continue
# look for iwiki
mo = reiwiki.match(line)
if mo and mo.group(1) == title:
Lsect['*iwiki'].append(line)
continue
# trailing spaces
if len(line) > 2 and line[0] == '=' and line[-1:] == ' ': acts.add('rm spaces after header')
line = line.strip()
# take out dividers
if line.startswith('----'):
if line == '----': divs += 1
continue
# other lines
Lsect[curr].append(line)
# any language sections?
if len(Lsect) == 2:
# no, tag if not tagged
if text.find('nolanguage/box') < 0 and text.find('{{wikify') < 0:
Lsect['*prolog'][0:0] = ['{{subst:nolanguage}}']
acts.add('tagged nolanguage')
# each section
for lang in Lsect:
if lang[0] == '*': continue
# find Etymologies first
etys = [ ]
etycount = 0
fh = True
for i in range(len(Lsect[lang])):
line = Lsect[lang][i]
# look for ety headers, and Pronunciation first at L4
mo = reheader.match(line)
if mo:
level = len(mo.group(1))
header = mo.group(2).strip()
# rest = mo.group(3)
# special case pronunciation, occurs with some frequency
if fh and level != 3 and fuzzy(header.lower(), 'pronunciation') >= 11 and len(header) < 15:
acts.add('Pronunciation changed to level 3')
Lsect[lang][i] = '===' + header + '==='
# and leave fh set:
continue
# just do fuzzy!
if fuzzy(header.lower(), 'etymology') >= 7 and len(header) < 20:
if level != 3:
if fh:
# first header, okay to fix!
acts.add('Etymology changed to level 3')
# and leave fh set:
etycount += 1
etys.append(i)
continue
elif not rfc:
Lsect[lang][i] = line + '{{rfc-level|Etymology not at level 3}}'
acts.add('+{{rfc-level|Etymology not at level 3}}')
rfc = True
continue
else:
print "(ety not at L3 and already rfc)"
continue
etycount += 1
etys.append(i)
fh = False
# then fix/rewrite the ety headers, use sub to handle rest, report any changes (spacing an issue):
if etycount:
for i in range(etycount):
line = Lsect[lang][etys[i]]
# print 'ety check replace ' + line
if etycount > 1: newline = reheader.sub(r'===Etymology %d===\3' % (i+1), line)
else: newline = reheader.sub(r'===Etymology===\3', line)
if newline.strip('= ') != line.strip('= '):
acts.add('header -' + line.strip('= ') + ' +' + newline.strip('= '))
Lsect[lang][etys[i]] = newline
# sigh, think that's it? Sweet, if true...
# general format
newlines = [ ]
inPos = inTrans = inPro = inext = defnext = False
npos = 0
ety = nety = 0
levelact = ''
rfctag = ''
for line in Lsect[lang]:
# move cats, may be something else on the line too, or multicats ...
# first we need a cat-present predicate
catp = False
for cat in recat.findall(line):
catp = True
catname = cat[11:-2].split('|')[0]
catname = re.sub('_', ' ', catname).strip()
cf = cat.find('|')
if cf > 0: cat = '[[Category:' + catname + cat[cf:]
else: cat = '[[Category:' + catname + ']]'
# we have a canonical cat! is it a novel cat?
if cat in catseen:
acts.add('rm dup cat [[:' + cat[2:])
continue
catseen.add(cat)
# see if it belongs in a different sect
catmove = False
if catname.find(':') >= 0:
catcode = catname.split(':')[0]
if catcode in Lcodes:
catlang = Lcodes[catcode]
if catlang != lang and catlang in Lcats:
acts.add('category ' + catname + ' moved to ' + catlang + ' section')
Lcats[catlang].append(cat)
catmove = True
elif not catname.lstrip(' 01').startswith(lang) and not catname.endswith('derivations'):
for other in Lcats:
if other == lang: continue
if catname.lstrip(' 01').startswith(other+' '):
acts.add('category ' + catname + ' moved to ' + other + ' section')
Lcats[other].append(cat)
catmove = True
break
# not moved
if not catmove: Lcats[lang].append(cat)
if catp:
line = recat.sub('', line).strip()
if not line: continue
# headers
mo = reheader.match(line)
if mo:
# hit header with no defn line in previous section?
if defnext and header not in HAN:
newlines.append('# {{defn|%s}}' % lang)
acts.add('no definition line for %s/%s added {defn}' % (lang, header))
level = len(mo.group(1))
header = mo.group(2).strip()
rest = mo.group(3)
# unlink header
hf = reunlink.sub(r'\1', header)
if hf != header:
if hf.find('|') > 0: hf = hf.split('|')[1]
acts.add('header -' + header + ' +' + hf)
header = hf
# fix header
if header.lower() in Hfix:
hf = Hfix[header.lower()]
if hf != header:
acts.add('header -' + header + ' +' + hf)
header = hf
# try a fuzzy!
if header.lower() not in Hfix and header[0:2] != '{{':
high = 0
replac = ''
hf = header.strip('[]{}').lower()
for val in sorted(Hfix):
# first character must match
if hf[0:1] != val[0:1]: continue
rawsc = fuzzy(hf, val)
if rawsc > high and rawsc > max(max(len(hf), len(val)) - 3, 5):
high = rawsc
replac = val
print 'fuzzy for %s: %s score %d' % (safe(hf), replac, high)
if high:
hf = Hfix[replac]
acts.add('header -' + header + ' +' + hf)
header = hf
# tag Transitive and Intransitive verb
if header.lower() in ('transitive verb', 'intransitive verb') and not rfc:
rfctag = '{{rfc-trverb|' + header + '}}'
rfc = True
# print "trans/intrans header: %s" % safe(header)
# tag X phrase
if header.endswith(' phrase') and not rfc:
rfctag = '{{rfc-xphrase|' + header + '}}'
rfc = True
# print "X phrase header: %s" % safe(header)
# rfc unrecognized, ignore templates for now, use NS later
if header.lower() not in Hfix and not rfc and header[0:2] != '{{':
rfctag = '{{rfc-header|' + header + '}}'
rfc = True
# print "unknown header: %s" % safe(header)
# min level, set and comp for nested ety
if level == 3 and header.startswith("Etymology") and etycount > 1:
ety = 1
nety += 1
npos = 0
push = False
else:
if ety:
# if we are in the last ety sect, and see end of section things at L3:
if level < 4 and nety == etycount and header in EOS: inPos = ety = 0
# and ... independent of connel flag, because we always push ;-)
if level < 4 and nety == etycount and header in L43: inPos = ety = 0
# push POS (or level 3?) sections down in ety, push flag because of Connel fix
# may be a good idea anyway ... yes, but if we rfc, stop
if ety and not rfc:
if ((header in POS and header not in HAN) or header in TOS) and level == 3:
level = 4
acts.add('header in ety sect ' + header + ' to L' + str(level))
if header == 'Pronunciation':
rfctag = '{{rfc-level|check placement of Pronunciation}}'
push = True
elif ((header in POS and header not in HAN) or header in TOS):
# at correct level! (or too deep already)
push = False
elif push and header in Level and (level == 4 or level < Level[header] + ety):
level += 1
acts.add('header in ety sect ' + header + ' to L' + str(level))
elif level < 4: push = False
# code to shift header levels (general case in POS), disabled per Connel, 18.4.7
if inPos and header in L43:
if npos < 2 and level < 4 + ety:
if not Connel:
level = 4 + ety
acts.add('header ' + header + ' to L' + str(level))
else: levelact = ' (AutoFormat would have corrected level of ' + header +')'
elif inPos and header in Level:
if level < Level[header] + ety:
if not Connel:
level = Level[header] + ety
acts.add('header ' + header + ' to L' + str(level))
else: levelact = ' (AutoFormat would have corrected level of ' + header +')'
# now tag remaining problems if any, various cases
# should all contain "+" for the re-visit trick ...
if not rfc and level == 4 + ety and not inPos and header in Level and header not in NS:
rfctag = '{{rfc-level|' + header + ' at L4+ not in L3 POS section' + levelact + '}}'
elif not rfc and level == 3 + ety and header.startswith('Translation'):
rfctag = '{{rfc-level|' + header + ' at L3+' + levelact + '}}'
elif not rfc and level == 5 + ety and not inTrans and header.startswith('Translations to'):
rfctag = '{{rfc-level|' + header + ' at L5+, not in Translations' + levelact + '}}'
# blank line
newlines.append('')
# header + anything else that wasn't blank
newlines.append('======'[0:level] + header + '======'[0:level])
if rest.strip():
if not rest.startswith('{{rfc-'): acts.add('moved stuff after ' + header + ' header')
newlines.append(rest.strip())
# pointless, already stripped: elif len(rest) and not acts: acts.add('rm spaces after header')
if rfctag:
acts.add('+' + rfctag)
rfc = True
newlines.append(rfctag)
rfctag = ''
# set flags:
inext = defnext = False
if level < 4 + ety and (header in POS or header.startswith(HT)):
inext = inPos = True
npos += 1
elif level < 4 + ety: inPos = False
inTrans = (header == 'Translations')
tt = False
inPro = (header == 'Pronunciation')
continue
# look for inflection line
if inext:
if (line.startswith('{{') and line != '{{wikipedia}}') or line.startswith("''") or fuzzy(line,
title) > len(title) - 1:
inext = False
defnext = True
#print "infl: %s" % safe(line)
#else:
# print "not infl: %s" % safe(line)
if line and line[0] == '#':
acts.add('added inflection line for %s/%s' % (lang, header))
if lang in Ltocode:
newlines.append(infline(title, Ltocode[lang], header))
else:
newlines.append(infline(title, '', header))
defnext = True
inext = False
# and also do next case for defnext
# elide blanks above inflection line
if not line: continue
# look for definition lines
if defnext and line and line[0] == '#':
newlines.append('')
defnext = False
# serious stuff ...
if len(line) > 2 and line[0] == '#' and line[1] not in (' ', ':', '*', '#'):
line = '# ' + line[1:]
if line[0:2] == '# ':
# look for context tag
if lang in Ltocode:
ctxn = 1
mo = recontext.match(line)
if not mo:
ctxn = 2
mo = recontext2.match(line)
if not mo:
ctxn = 3
mo = recontext3.match(line)
if mo:
print "match context tag %s" % safe(mo.group(1))
tname = cpar(mo.group(1), Contexts)
if mo and tname:
if lang != 'English': tname += '|lang=' + Ltocode[lang]
if contextp and ctxn == 1:
acts.add("-(''" + mo.group(1) + "'') +{{" + tname + "}}")
line = recontext.sub(r'# {{' + tname + r'}} \2', line)
elif contextp and ctxn == 2:
acts.add("-''(" + mo.group(1) + ")'' +{{" + tname + "}}")
line = recontext2.sub(r'# {{' + tname + r'}} \2', line)
elif contextp and ctxn == 3:
acts.add("-{{italbrac|" + mo.group(1) + "}} +{{" + tname + "}}")
line = recontext3.sub(r'# {{' + tname + r'}} \2', line)
else: print "would have replaced %s with %s" % (safe(mo.group(1)), safe(tname))
# elide cats that correspond
for catname in tname.split('|'):
if catname == 'context' or catname[0:5] == 'lang=': continue
catname = catname[0:1].upper() + catname[1:]
# code is prefix ...
if lang != 'English': catname = Ltocode[lang] + ':' + catname
if contextp:
catseen.add('[[Category:' + catname + ']]')
# catseen.add('[[Category:' + catname + 's]]')
print "added catseen %s" % safe(catname)
# wikilinking?
"""
elif line[0:1] != '#' and not inTrans and line.find("''") >= 0:
# look for italbrac cases not on defn lines
newl = reibcomma.sub(ibsub, line)
newl = reibcomma2.sub(ibsub, newl)
if newl != line:
# acts.add('-' + line + ' +' + newl)
# acts.add('template i')
# in pronunciation, use a, anywhere else, we want i-c if at start of * line
if inPro:
newl = re.sub(r'\{\{(i|i-c)\|', '{{a|', newl)
else:
newl = re.sub(r'\{\{i\|', '{{i-c|', newl)
acts.add(sdif(line, newl))
line = newl
# think that will work?
"""
# translations lines
# stopgap check:
if line.find('{{ttbc|') >= 0: inTrans = False
if inTrans:
# special indent rule, we know there is a previous line
if line.startswith(':') and newlines[-1:][0].startswith('*'):
acts.add('-: +*: in trans')
line = '*' + line
was = False
mo = retrans1.match(line)
if not mo: mo = retrans2.match(line)
if mo: was = True
if not mo: mo = retrans3.match(line)
if mo:
tlang = mo.group(1).strip()
if was and tlang.find('|') > 0: tlang = tlang.split('|')[1]
trest = mo.group(2).strip()
if tlang.startswith('{{') and tlang[2:-2] in Lcodes:
acts.add('subst %s in trans' % tlang)
tlang = Lcodes[tlang[2:-2]]
was = False
if was and (tlang in Top40 or title == tlang):
acts.add('trans unlink ' + tlang)
elif not was and tlang in Classics and title != tlang:
tlang = '[[' + tlang + ']]'
acts.add('trans link ' + tlang)
elif was:
# leave as is (was)
tlang = '[[' + tlang + ']]'
tr = regender.sub(r'{{\1}}', trest)
if tr != trest:
#acts.add('gender -' + trest + ' +' + tr)
acts.add('gender ' + sdif(trest, tr))
trest = tr
if trest: line = '*' + tlang + ': ' + trest
else: line = '*' + tlang + ':'
# convert templates
# has to be a non-blank previous line, we are in trans section
if line == '{{rfc-trans}}': inTrans = False
if line == '{{checktrans}}': inTrans = False
if line == '{{checktrans-top}}': inTrans = False
if line == '{{ttbc-top}}': inTrans = False
mo = retopgloss.match(line)
if mo:
if mo.group(1):
gloss = mo.group(1)[1:]
else:
prev = newlines[-1:][0]
while not prev:
newlines = newlines[:-1]
prev = newlines[-1:][0]
if prev.startswith(';'): gloss = prev[1:]
elif prev.startswith("'''") and prev.endswith("'''"): gloss = prev[3:-3]
else: gloss = ''
if gloss: newlines = newlines[:-1]
if gloss:
gloss = reglossfix.sub(r'\1', gloss).strip()
prev = line
line = '{{trans-top|' + gloss + '}}'
# <- else: line = '{{trans-top}}'
acts.add('-' + prev + ' +' + line)
tt = True
if tt and line == '{{mid}}':
line = '{{trans-mid}}'
if tt and line == '{{bottom}}':
newlines.append('{{trans-bottom}}')
# add blank line
line = ''
tt = False
# end of trans
# all else
newlines.append(line)
# at end with no defn line in previous section?
if defnext and header not in HAN:
newlines.append('# {{defn|%s}}' % lang)
acts.add('no definition line for %s/%s added {defn}' % (lang, header))
# done with sect
Lsect[lang] = newlines
# reassemble ...
newtext = ''
prior = False
for line in Lsect['*prolog']:
# no blank lines
if line: newtext += line + '\n'
del Lsect['*prolog']
blank = True
for lang in sorted(Lsect, key=lkey):
if lang == '*iwiki': continue
if prior:
if not blank: newtext += '\n'
newtext += '----\n\n'
divs -= 1
prior = True
if lang not in waslinked: newtext += '==' + lang + '==\n'
else: newtext += '==[[' + lang + ']]==\n'
blank = False
for line in Lsect[lang]:
# no dup blank lines
if line or not blank: newtext += line + '\n'
if line: blank = False
else: blank = True
if len(Lcats[lang]):
if not blank: newtext += '\n'
for cat in sorted(Lcats[lang]): newtext += cat + '\n'
blank = False
del Lsect[lang]
if not blank: newtext += '\n'
for line in Lsect['*iwiki']:
# no blank lines
if line: newtext += line + '\n'
if divs != 0: acts.add("fixed ----'s")
# rfc-level, etc trickery
for rfname in ('level', 'xphrase', 'header'):
if 'rm rfc-' + rfname + ' tag' in acts:
for ac in sorted(acts):
if ac.startswith('+{{rfc-' + rfname):
acts.remove('rm rfc-' + rfname + ' tag')
acts.remove(ac)
print 'elided -' + rfname + ' +' + rfname
break
"""
# if a new page, and no other actions, do minor spacing
# (if rfc-tag trickery, this won't apply anyway, page is not new)
if title in newpages and not acts and newtext.rstrip(' \n') != text.rstrip(' \n'):
acts.add('minor spacing')
"""
# if we added a major rfc, just do that, dump the rest of the work!!
if rfcact:
acts = set()
acts.add(rfcact)
newtext = text
act = ''
for ac in sorted(acts): act += ', ' + ac
act = act.strip(', ')
# some change, write it (even just rm tag)
if act:
fixed += 1
print "format %s: %s" % (safe(title), safe(act))
# try to fix the entry
try:
wikipedia.setAction(act)
page.put(newtext)
except wikipedia.PageNotSaved:
print "failed to save page"
# other action?
continue
except socket.timeout:
print "socket timeout, maybe not saving page"
continue
except socket.error:
print "socket error, maybe not saving page"
continue
# end loop
print "entries fixed %d" % fixed
# done
if __name__ == "__main__":
try:
main()
finally:
wikipedia.stopme()