Springe nei ynhâld

Meidogger:Winterbot/AutoFormatcode

Ut Wikiwurdboek

(Taken from Robert Ullmann's famous bot on en wikt)

notice

[bewurkje]

This comes with several very important caveats:

  • I am a professional software engineer, this is what I do; however this code was written for my own use, and is not warranted, and does not carry any implication of merchantability or fitness for use.
  • Like everything else in the Wiktionary, this is under GFDL. GFDL is not compatible with the GPL, this document is not licensed under the GPL as software. (!)
  • At any given moment, this code may not represent what is being run; I have no intention of updating this page every time I make a change.

technical notes

[bewurkje]

I don't have my ego attached to code I write; I routinely dump code that has gotten too complex, and re-write it. On the other hand, even if something is sloppy, if it is tested and works, I leave it alone.

  • Some of the comments may be snarky.
  • The comments are often (usually) written to remind me of something, not to explicate the code.
  • Since I modify this regularly, there is code that is not reached or otherwise redundant.
  • The pre-parsing should go deeper; a fairly major restructuring would be helpful at some point soon.
  • There are a small number of known (to me ;-) bugs that I handle by monitoring the edits done, having not yet fixed them. (Like handling multi-line comments.)
  • The wikipedia.py module AF uses is heavily modified from the distro; however the interface is the same. In the presence of network problems/failures/outages AF may abort when the modified version would have recovered. The exceptions thrown are the same, but under differing conditions.
  • On Linux, the clock timing works, but will display ugly large values.
  • The code to handle headers is largely hacked to implement the "Connel" flag ....
  • The code that handles Etymology headers is based on the current WT:ELE; there is no problem changing it when we figure out how Etymology and Pronunciation are supposed to play nicely together in the general case.
  • It must have a sysop account as well, to read patrolled flags in RC; "enhanced" RC mode must be turned off.

outline

[bewurkje]
prescreen

Reads the XML dump, uses simple regex to find entries that may need attention, and builds a random index

rcpages

Generator called by the main routine. Calls prescreen, then cycles through reading Recent Changes, looking at the request category, and yielding pages found.

main

Reads configuration pages, builds tables to be used. Loops on rcpages generator, for each entry:

  • runs regex on the entire text
  • breaks entry into language sections, plus prolog (above first section), and iwikis
  • in each language section:
    • looks for and fixes Etymology headers
    • herds cats
    • fixes bad headers
    • fixes linking in trans tables
    • fixes top to trans-top
    • subst's (replaces) language code template
    • etc
  • then reassembles the entry, removing multiple blank lines, adding ---- rules, and so on
  • checks the actions performed
  • if any resulting action, rewrites the page

code

[bewurkje]
#!/usr/bin/python
# -*- coding: utf-8  -*-

############################################################
#                                                          #
# NOTE: utf-8 doesn't work for me on fy wikt; Latin1 does. #
#                                                          #
############################################################


"""
This bot looks for entries tagged for autoformatting, does a number of tasks

No command line arguments.

"""

import wikipedia
import catlib
import sys
import re
import pickle
import time
import xmlreader
import socket

def safe(s):
    return pickle.dumps(s)[1:-5]

def lkey(l):
    n = l.strip('[]')
    if n == 'Translingual': return '0' + n
    if n == 'English': return '1' + n

    # bad L2 headers
    if n.lower() == 'cyrillic alphabet': return '0' + n
    if n.lower() == 'arabic alphabet': return '0' + n
    if n.lower() == 'see also': return '3' + n
    if n.lower() == 'references': return '4' + n

    return '2' + n

PSK = { }
from random import random
AH = set()
#newpages = set()

def prescreen():

    # get XML dump
    dump = xmlreader.XmlDump("../hancheck/en-wikt.xml")

    srx = { }

    srx['lcode header'] = re.compile(r'^== *\{\{.{2,3}\}\} *==', re.M)
    srx['lcode trans'] = re.compile(r'^\* *\{\{.{2,3}\}\} *:', re.M)
    srx['top template w/param'] = re.compile(r'\{\{top\|')
    srx['top template w/semi gloss'] = re.compile(r'^;.*\n\{\{top\}', re.M)
    srx['top template w/qbold gloss'] = re.compile(r"'''\n\{\{top\}")
    srx['gender'] = re.compile(r"^\*.*:.*''[fmcn]''", re.M)
    # srx['wikipediapar'] = re.compile(r"\{\{[Ww]ikipediapar")
    srx['Wikipedia'] = re.compile(r"\{\{Wikipedia")
    srx['Unicode'] = re.compile(r"\{\{Unicode")
    srx['Acronym'] = re.compile(r"\{\{Acronym")
    srx['Initialism'] = re.compile(r"\{\{Initialism")
    srx['Abbreviation'] = re.compile(r"\{\{Abbreviation")
    srx['cattag'] = re.compile(r"\{\{cattag")
    srx['trad'] = re.compile(r"\{\{trad-?\|")
    srx['rest after header'] = re.compile('^=+[^=\n]+=+[^=\n]+$', re.M)
    srx['Pronounciation'] = re.compile('Pronounciation')
    srx['categorized'] = re.compile('[Cc]ategori[sz]ed')
    # srx['transitive verb'] = re.compile('=== ?(Int|T)ransitive verb', re.I)
    srx['etymology with parens'] = re.compile('Etymology ?\(')
    srx['etymology at L4'] = re.compile('^==== ?Etymology', re.M)
    srx['also see'] = re.compile('= ?Also see')
    srx['indented see'] = re.compile(r'^:\{\{see\|', re.M)
    srx['indented Cyrillic'] = re.compile(r'^:Cyrillic', re.M)
    srx['indented Roman'] = re.compile(r'^:Roman', re.M)

    # next really needs something re lang sects, try it for now, sorta works
    # srx['maybe un-numbered ety'] = re.compile(r'^=== ?Etymology ?===.*Etymology', re.M|re.S)

    srx['PAGENAME'] = re.compile('\{PAGENAME')
    srx['-----'] = re.compile('-----')

    # header case problems ...
    srx['lc header'] = re.compile(r'^={1,6} ?[a-z][-a-zA-Z ]*=+$', re.M)
    srx['non sc header'] = re.compile(r'^={3,6} ?[A-Z][-a-z ]*[A-Z][-a-zA-Z ]*=+$', re.M)

    # contexts
    srx['context tag'] = re.compile(r"^# *\(''.+?''\)", re.M)
    srx['context tag 2'] = re.compile(r"^# *''\(.+?\)''", re.M)
    srx['context italbrac'] = re.compile(r"^# *\{\{italbrac", re.M)


    # re-work rfc level tags (not for now! don't go looking for them)
    # srx['rfc level'] = re.compile(r'^\{\{rfc-level.*\+',re.M)
    # srx['X phrase'] = re.compile(r'^={3,5} *[-a-zA-Z ]* phrase *=+$', re.M)

    # and so on
    srx['chu nom crap'] = re.compile(r'^===\s*chu nom\s*===', re.M|re.I)

    reah = re.compile(r'^={3,6} *([-a-zA-Z ]+) *=+$', re.M)

    counts = { 'at random':0, 'unknown header':0 }
    for r in srx: counts[r] = 0

    entries = 0
    tags = 0

    piscine = set()
    # skip a few others besides the level 3-6 headers
    AH.add('Mandarin')
    AH.add('Cantonese')
    AH.add('Min Nan')

    for entry in dump.parse():
        text = entry.text
        title = entry.title
        if title.find(':') >= 0: continue
        if text and text[0:1] == '#': continue

        entries += 1
        if entries % 10000 == 0: print "prescreen: %d entries, %d tagged" % (entries, tags)

        # screen entries:
        tag = False

        for reason in srx:
            if srx[reason].search(text):
                # most of context, but not all ...
                if reason.startswith('context') and random() > 0.7: continue
                tag = True
                break

        if not tag and random() < 0.0035:
            reason = 'at random'
            tag = True

        if not tag and text.find('{{rfc') < 0:
            for mo in reah.finditer(text):
                h = mo.group(1).strip()
                if h not in AH:
                    if h not in piscine:
                        print "header %s tagged" % safe(h)
                        piscine.add(h)
                    reason = "unknown header"
                    tag = True

        if tag:
            counts[reason] += 1
            # print 'prescreen: %s selected %s' % (safe(title), reason)
            PSK[random()] = (title, reason)
            tags += 1

    for r in sorted(counts):
        print 'count for %s is %d' % (r, counts[r])
    
def now(): return int(time.clock())
    
def rcpages(site):
    # generator which yields recentchanges, but not unpatrolled changes
    # also entries in category
    # in between, yields pages that satisfy the prescreen in random order

    prescreen()

    site = wikipedia.getSite("en", "wiktionary")
    cat = catlib.Category(site, "Category:Requests for autoformat")

    seen = set()

    nextcat = now() - 1
    nextrc = now() - 1

    hold = { }
    rcex = re.compile(r'<li>.*?<a href=".+?" title="(?P<title>.+?)">(?P=title)</a>')

    for r in sorted(PSK):
        # sorted!
        title, reason = PSK[r]
        seen.add(title)
        print '(%d, from prescreen %s)' % (now(), reason)
        page = wikipedia.Page(site, title)
        yield page

        nf = 0
        nd = 0

        # get our category, every 10-15 minutes or so
        if now() > nextcat:
            nextcat = now() + 700
            cat.catlist(purge = True)
            for page in cat.articles():
                nf += 1
                # time.sleep(3)
                print '(%d)' % now()
                seen.add(page.title)
                if page.title in hold: del hold[page.title]
                yield page

        # recent changes
        if now() > nextrc:
            print '(%d, reading recent changes)' % now()
            try:
                rct = site.getUrl("/w/index.php?title=Special:Recentchanges&limit=500", sysop = True)
            except wikipedia.NoPage:
                print "Can't get recentchanges from en.wikt!"
                rct = ''
                time.sleep(30)
                continue
            nextrc = now() + 610

            ht = 480
            for line in rct.splitlines():
                # if line.startswith('<li>'): print safe(line)
                if line.find('Deletion log') > 0: continue
                mo = rcex.match(line)
                if mo:
                    title = mo.group('title')
                    if title.find(':') >= 0: continue
                    if title.find('/') >= 0: continue
                    #if line.find('<span class="newpage">N</span>') > 0: newpages.add(title)
                    if line.find('"unpatrolled"') > 0: continue
                    if line.find('User talk:AutoFormat') > 0: seen.add(title)
                    if title not in seen:
                        seen.add(title)
                        hold[title] = now() + ht
                        # scatter out into future ...
                        ht += 34
                        nf += 1
                        print "found: [%s] hold until %d" % (safe(title), hold[title])

        pastime = now()
        for title in sorted(hold):
             # 10 on a pass is enough
             if nd > 9: break
             if hold[title] > pastime: continue
             del hold[title]
             nd += 1
             # time.sleep(3)
             print '(%d)' % now()
             page = wikipedia.Page(site, title)
             yield page

        if nd == 0:
            print "(%d, sleeping)" % now()
            # usually 10 to 20 seconds, but rely on put throttle
            time.sleep(70)

        print '(%d, %d entries held)' % (now(), len(hold))

        continue

# now have some serious recursion fun!
# fuzzy returns string match score

def fuzzy(a, b):

    if not a or not b: return 0

    if a[0:1] == b[0:1]: return 1 + fuzzy(a[1:], b[1:])

    # try skip this char
    sk = fuzzy(a[1:], b[1:])

    # try with each char forward
    p = a.find(b[0:1])
    if p >= 0: sca = 1 + fuzzy(a[p+1:], b[1:])
    else: sca = 0

    p = b.find(a[0:1])
    if p >= 0: scb = 1 + fuzzy(b[p+1:], a[1:])
    else: scb = 0

    return max(sk, sca, scb)

def infline(title, lang, header):

    pos = header.lower()
    if pos.startswith('{{'):
        pos = pos[2:-2].split('|')[0]

    if lang == 'en':
        if pos in ['verb', 'noun', 'adjective', 'adverb']:
            return "{{infl|en|" + pos + "}}[[Category:English "+ pos +"s that lack inflection template]]"

    a = ord(title[0:1])

    # Arabic:
    if (a >= 0x0600 and a < 0x0780):
        return "{{infl|%s|%s|sc=Arab}}" % (lang, pos)

    # Han:
    # this is planes 1-2, needs closer check
    if (a >= 0x3400 and a < 0xA000) or (a >= 0xd800 and a < 0xdc00):
        if lang == 'ko':
            return "{{infl|%s|%s|sc=Hant}}{{ko-attention|may need inflection template}}" % (lang, pos)
        elif lang == 'ja':
            return "{{infl|%s|%s|sc=Jpan}}{{ja-attention|needs inflection template}}" % (lang, pos)
        elif lang == 'vi':
            return "{{infl|%s|%s|sc=Hant}}{{vi-attention|may need inflection template}}" % (lang, pos)
        else:
            return "{{infl|%s|%s|sc=Hani}}{{zh-attention|needs inflection template}}" % (lang, pos)

    if lang == 'ja':
            return "{{infl|%s|%s}}{{ja-attention|needs inflection template}}" % (lang, pos)
    if lang == 'ko':
            return "{{infl|%s|%s}}{{ko-attention|may need inflection template}}" % (lang, pos)
    if lang in ['zh', 'cmn', 'yue', 'nan']:
            return "{{infl|%s|%s}}{{zh-attention|may need inflection template}}" % (lang, pos)

    return "{{infl|%s|%s}}" % (lang, pos)

MOD = [ 'chiefly', 'coarse', 'especially', 'extremely', 'frequently', 'generally', 'mainly', 'markedly',
    'mildly', 'mostly', 'often', 'particularly', 'primarily', 'sometimes', 'usually', 'very' ]

reunlink = re.compile(r'\[\[(.*?)\]\]')

PRETULIP = ('of ', 'by ')

def cpar(cstr, ctxs):

    # convert context string to template name(s)

    tname = ''
    cstr = re.sub(r'[,;\|]+', ',', cstr) 
    for cs in cstr.split(','):
        cs = cs.strip(" '")
        if cs.find('[') >= 0: cs = reunlink.sub(r'\1', cs)
        # handles n modifiers, does context? yes.
        while cs.split(' ')[0].lower() in MOD:
            mod = cs.split(' ')[0].lower()
            tname += mod + '|'
            cs = cs[len(mod):].strip()
        if cs.lower() in ctxs:
            tname += ctxs[cs.lower()] + '|'
        elif cs.startswith(PRETULIP):
            tname += cs + '|'
        else: return ''

    tname = tname.rstrip('|')
    if tname.startswith(PRETULIP): tname = 'context|' + tname

    return tname

def ibsub(imo):

    # some prefix captured
    pref = imo.group(1)
    istr = imo.group(2)
    s = reunlink.sub(r'\1', istr)
    # not general enough, bar pipes in match for now in re precomp
    #if s != istr and s.find('|') >= 0: s = s.split('|')[1]
    s = re.sub(r',\s*', '|', s)
    if imo.group(3) == ':':
        return pref + '{{i-c|' + s + '}}'
    else:
        return pref + '{{i|' + s + '}}'

def sdif(a, b):

    # returns -(a stuff) +(b stuff) when one change

    i = 0
    while a[i:i+1] and a[i:i+1] == b[i:i+1]: i += 1
    an = a[i:]
    bn = b[i:]
    i = 1
    while i < len(an) and an[-i:] == bn[-i:]: i += 1
    i -= 1
    if i: return '-' + an[:-i] + ' +' + bn[:-i]
    else: return '-' + an + ' +' + bn

    # okay, try that! not so pretty is it?

def main():

    socket.setdefaulttimeout(30)

    # regex precomp
    reL2head = re.compile(r'==?\s*([^=]+)={1,6}(.*)')
    reheader = re.compile(r'(={3,6})\s*([^=]+)={1,6}(.*)')
    reiwiki = re.compile(r'\[\[[-a-z]{2,11}:(.*)\]\]')
    recat = re.compile(r'\[\[category:.*?\]\]', re.I)
    retrans1 = re.compile(r'\*\s*\[\[w:.+\|([^\]]+?)\]\]\s*:(.*)')
    retrans2 = re.compile(r'\*\s*\[\[([^\]]+?)\]\]\s*:(.*)')
    retrans3 = re.compile(r'\*\s*([^:]+?):(.*)')
    retag = re.compile(r'\{\{rfc-auto(\|.*?|)}}')
    regender = re.compile(r"''([mfcn])''")
    reglossfix = re.compile(r'(.+)\(\d+\)$')
    retopgloss = re.compile(r'\{\{top(\|.*?|)}}$')
    recontext = re.compile(r"^# *\(''(.+?)''\):? ?(.*)$", re.M)
    recontext2 = re.compile(r"^# *''\((.+?)\):?'' ?(.*)$", re.M)
    recontext3 = re.compile(r"^# *\{\{italbrac\|([^}]+?)}}:? ?(.*)$", re.M)

    # be careful to match and remove newline in these:
    rerfclevel = re.compile(r"^\{\{rfc-level\|.*\+.*\}\}\n", re.M)
    rerfcxphrase = re.compile(r"^\{\{rfc-xphrase\|.*\}\}\n", re.M)
    rerfcheader = re.compile(r"^\{\{rfc-header\|.*\}\}\n", re.M)

    # italbracs not on context/defn lines, template italbrac->i replacement separate
    # limited forms ... nowilink with pipes, no templates, look for : in mo.g3
    # look for gloss, etc, * lines to start ...
    reibcomma = re.compile(r"^(\*\s*)\(''([^\)^'^\|^\{]+):?''\)(:?)")
    reibcomma2 = re.compile(r"^(\*\s*)''\(([^\)^'^\|^\{]+):?\)''(:?)")

    # regex table (dict, name = tuple of compiled object and replacement)
    Regex = { }
    Regex['subst:PAGENAME'] = (re.compile(r'\{\{PAGENAME}}'), '{{subst:PAGENAME}}')
    Regex['template -cattag +context'] = (re.compile(r'\{\{cattag\|'), '{{context|',)
    Regex['template -Unicode +unicode'] = (re.compile(r'\{\{Unicode\|'), '{{unicode|',)
    #Regex['template -wikipediapar +wikipedia'] = (re.compile(r'\{\{[Ww]ikipediapar([\|\}])'), r'{{wikipedia\1')
    Regex['template -Wikipedia +wikipedia'] = (re.compile(r'\{\{Wikipedia([\|\}])'), r'{{wikipedia\1')
    Regex['template -Acronym +acronym'] = (re.compile(r'\{\{Acronym([\|\}])'), r'{{acronym\1')
    Regex['template -Initialism +initialism'] = (re.compile(r'\{\{Initialism([\|\}])'), r'{{initialism\1')
    Regex['template -Abbreviation +abbreviation'] = (re.compile(r'\{\{Abbreviation([\|\}])'), r'{{abbreviation\1')

    Regex['template -trad +t'] = (re.compile(r'\{\{trad\|'), '{{t|',)
    Regex['template -trad- +t-'] = (re.compile(r'\{\{trad-\|'), '{{t-|',)

    Regex['un-indent {{see}} template'] = (re.compile(r'^:\{\{see\|', re.M), '{{see|')

    # make sure we are logged in
    site = wikipedia.getSite("en", "wiktionary")
    site.forceLogin(sysop = True)
    site.forceLogin(sysop = False)

    # get our config pages, throw exceptions: we have to stop if we can't read these
    # use sysop to bypass "page is locked?" exception

    print "read languages"
    page = wikipedia.Page(site, "User:AutoFormat/Languages")
    langtab = page.get(sysop = True)
    print "read headers"
    page = wikipedia.Page(site, "User:AutoFormat/Headers")
    headtab = page.get(sysop = True)
    print "read Top40"
    page = wikipedia.Page(site, "Wiktionary:Translations/Wikification")
    top40tab = page.get(sysop = True)
    print "read contexts"
    page = wikipedia.Page(site, "User:AutoFormat/Contexts")
    ctxtab = page.get(sysop = True)

    Lcodes = { }
    Ltocode = { }
    relangtab = re.compile(r'\| (.*?)\|\|(.*)')
    i = 0
    for line in langtab.splitlines():
        mo = relangtab.match(line)
        if mo:
            for code in mo.group(1).split(','):
                Lcodes[code.strip()] = mo.group(2).strip()
                i += 1
            Ltocode[mo.group(2).strip()] = mo.group(1).split(',')[0].strip()

    print "found %d language codes" % i

    # treat a couple of other codes as Mandarin etc, since they are in cats:
    Lcodes['zh-cn'] = 'Mandarin'
    Lcodes['zh-tw'] = 'Mandarin'
    Lcodes['nan-cn'] = 'Min Nan'
    Lcodes['nan-tw'] = 'Min Nan'
    Lcodes['yue-cn'] = 'Cantonese'
    Lcodes['yue-hk'] = 'Cantonese'


    Level = { }
    L43 = { }
    POS = { }
    EOS = [ 'See also', 'References', 'External links', 'Anagrams']
    TOS = [ 'Pronunciation', 'Alternative spellings', 'Alternative forms']
    HAN = ['Han character', 'Kanji', 'Hanzi', 'Hanza']
    HT = ( '{{abbreviation', '{{initialism', '{{acronym', '{{numeral' )
    NS = { }
    Hfix = { }
    reheadtab = re.compile(r'\| (.*?)\|\|\s*([1-5/]*)\s*\|\|(.*?)\|\|(.*?)\|\|(.*)')
    i = 0
    for line in headtab.splitlines():
        mo = reheadtab.match(line)
        if mo:
            header = mo.group(1).strip()
            if mo.group(2).strip() == '4/3':
                L43[header] = True
                Level[header] = 4
                print "header %s is 4/3" % header
            else: Level[header] = int(mo.group(2))
            if mo.group(3).strip() == 'NS': ns = NS[header] = True
            else: ns = False
            if mo.group(4).strip() == 'POS': POS[header] = True

            for variant in mo.group(5).split(','):
                variant = variant.lower().strip()
                if not variant: continue
                Hfix[variant] = header
                if not ns:
                    if variant[-1:] == 's': Hfix[variant[:-1]] = header
                    else: Hfix[variant + 's'] = header

            Hfix[header.lower()] = header
            if not ns:
               if header[-1:] == 's': Hfix[header.lower()[:-1]] = header
               else: Hfix[header.lower() + 's'] = header
            i += 1

    print "found %d headers" % i

    # lots of possible ety sects, 1 to 24
    for i in range(1, 25):
        Hfix['etymology %d'%i] = 'Etymology %d'%i
        Level['Etymology %d'%i] = 3

    Top40 = { }
    Classics = { }
    retop40tab = re.compile(r'\*\s*(.*)')
    i = j = 0
    inT40 = True
    for line in top40tab.splitlines():
        if line.startswith('----'): inT40 = False
        mo = retop40tab.match(line)
        if mo:
            lang = mo.group(1).strip(' []')
        else: continue
        if inT40:
            Top40[lang] = True
            i += 1
        else:
            if lang in Top40:
                print "language %s in both Top40 and Classics?" % safe(lang)
                continue
            Classics[lang] = True
            j += 1
                

    print "found %d Top 40 languages" % i
    print "found %d Classic languages" % j

    # add all other known languages not in Top40:
    i = 0
    for code in Lcodes:
        lang = Lcodes[code]
        if lang not in Top40 and lang not in Classics:
            if lang == 'English': continue
            Classics[lang] = True
            i += 1
            print "added Classic: %s" % safe(lang)

    print "added %d languages to Classics" % i

    Contexts = { }
    rectxtab = re.compile(r"\|\s*''(.*?)''\s*\|\|(.*)")
    i = 0
    for line in ctxtab.splitlines():
        mo = rectxtab.match(line)
        if mo:
            m1 = mo.group(1).strip()
            m2 = mo.group(2).strip()
            if not m1 or not m2: continue
            # only use first, table at top over-rides auto, templates over-ride redirects
            if m1 not in Contexts: Contexts[m1] = m2
            i += 1

    print "found %d context templates" % i

    # turn on/off for now
    contextp = True

    entries = 0
    fixed = 0

    # (specific stats)

    # Set up set of all headers that are valid (at L3 or higher)

    for header in Level:
        AH.add(header)

    # Sigh. True means prohibited from changing 4/3 levels
    Connel = True

    for page in rcpages(site):

        title = page.title()

        print "page %s" % safe(title)

        if title.find(':') >= 0:
            print "not in main namespace"
            continue

        if title.find('/') >= 0:
            print "apparent subpage"
            continue

        entries += 1

        try:
            text = page.get()
        except wikipedia.NoPage:
            print "Can't get %s from en.wikt" % safe(title)
            text = ''
            continue
        except wikipedia.IsRedirectPage:
            print "Redirect page %s" % safe(title)
            text = ''
            continue

        acts = set()

        mo = retag.search(text)
        if mo:
            if mo.group(1).strip(' |'):
                acts.add('rm tag:' + mo.group(1).strip(' |'))
            else:
                acts.add('rm tag')
            text = retag.sub('', text)

        # rfc level trickery
        newtext = rerfclevel.sub('', text)
        if newtext != text:
            print 'took out rfc-level'
            acts.add('rm rfc-level tag')
            text = newtext

        # same for xphrase
        newtext = rerfcxphrase.sub('', text)
        if newtext != text:
            print 'took out rfc-xphrase'
            acts.add('rm rfc-xphrase tag')
            text = newtext

        # same for header
        newtext = rerfcheader.sub('', text)
        if newtext != text:
            print 'took out rfc-header'
            acts.add('rm rfc-header tag')
            text = newtext

        if text.find('{{rfc') >= 0: rfc = True
        #elif text.find('{{rfc|') >= 0: rfc = True
        #elif text.find('{{rfc-') >= 0: rfc = True
        else: rfc = False
        rfcact = ''

        # overall regex, using table 

        for rx in Regex:
            newtext = Regex[rx][0].sub(Regex[rx][1], text)
            if newtext != text:
                acts.add(rx)
                text = newtext

        # categories found in the entry or implied by context and perhaps inflection templates
        catseen = set()

        # now parse. take the entry apart into languages (ha!)

        curr = '*prolog'
        last = ''
        Lsect = { '*prolog':[ ], '*iwiki':[ ] }
        Lcats = { }
        waslinked = [ ]
        divs = 0

        for line in text.splitlines():

            # L2 headers
            mo = reL2head.match(line)
            if mo:
                header = mo.group(1).strip()
                hf = reunlink.sub(r'\1', header)
                if hf != header:
                    if hf.find('|') >= 0: hf = hf.split('|')[1]
                    if hf not in Top40: waslinked.append(hf)
                    else: acts.add('unlink language header ' + hf)
                    header = hf

                # validate language
                if header in Level:
                    if not rfc:
                        text = '{{rfc|' + header + ' as level 2 header}}\n' + text
                        rfcact = 'add rfc tag for L1/2 header ' + header
                        rfc = True
                    else:
                        print "(no edit, bad L2 header and rfc)"
                        rfcact = 'bad L1/2 header ' + header

                # subst code template
                if header[0:2] == '{{':
                    if header[2:-2] in Lcodes:
                        hf = Lcodes[header[2:-2]]
                        acts.add('L2 header -' + header + ' +' + hf)
                        header = hf

                # check sort order
                if header and last and lkey(header) < lkey(last):
                    acts.add(last + '/' + header + ' sorted into order')
                last = header

                if header not in Lsect:
                    Lsect[header] = [ ]
                    Lcats[header] = [ ]

                else:
                    acts.add('merged ' + header + ' sections')
                curr = header
                if mo.group(2).strip():
                    acts.add('stuff after L2 header moved')
                    Lsect[curr].append(mo.group(2).strip())
                continue

            # look for iwiki
            mo = reiwiki.match(line)
            if mo and mo.group(1) == title:
                Lsect['*iwiki'].append(line)
                continue

            # trailing spaces
            if len(line) > 2 and line[0] == '=' and line[-1:] == ' ': acts.add('rm spaces after header')
            line = line.strip()

            # take out dividers
            if line.startswith('----'):
                if line == '----': divs += 1
                continue

            # other lines
            Lsect[curr].append(line)

        # any language sections?
        if len(Lsect) == 2:
            # no, tag if not tagged
            if text.find('nolanguage/box') < 0 and text.find('{{wikify') < 0:
                Lsect['*prolog'][0:0] = ['{{subst:nolanguage}}']
                acts.add('tagged nolanguage')

        # each section

        for lang in Lsect:
            if lang[0] == '*': continue

            # find Etymologies first

            etys = [ ]
            etycount = 0
            fh = True
            for i in range(len(Lsect[lang])):
                line = Lsect[lang][i]

                # look for ety headers, and Pronunciation first at L4
                mo = reheader.match(line)
                if mo:
                    level = len(mo.group(1))
                    header = mo.group(2).strip()
                    # rest = mo.group(3)

                    # special case pronunciation, occurs with some frequency

                    if fh and level != 3 and fuzzy(header.lower(), 'pronunciation') >= 11 and len(header) < 15:
                        acts.add('Pronunciation changed to level 3')
                        Lsect[lang][i] = '===' + header + '==='
                        # and leave fh set:
                        continue

                    # just do fuzzy!
                    if fuzzy(header.lower(), 'etymology') >= 7 and len(header) < 20:
                        if level != 3:
                            if fh:
                                # first header, okay to fix!
                                acts.add('Etymology changed to level 3')
                                # and leave fh set:
                                etycount += 1
                                etys.append(i)
                                continue
                            elif not rfc:
                                Lsect[lang][i] = line + '{{rfc-level|Etymology not at level 3}}'
                                acts.add('+{{rfc-level|Etymology not at level 3}}')
                                rfc = True
                                continue
                            else:
                                print "(ety not at L3 and already rfc)"
                                continue
                        etycount += 1
                        etys.append(i)
                    fh = False

            # then fix/rewrite the ety headers, use sub to handle rest, report any changes (spacing an issue):
            if etycount:
                for i in range(etycount):
                    line = Lsect[lang][etys[i]]
                    # print 'ety check replace ' + line
                    if etycount > 1: newline = reheader.sub(r'===Etymology %d===\3' % (i+1), line)
                    else: newline = reheader.sub(r'===Etymology===\3', line)
                    if newline.strip('= ') != line.strip('= '):
                        acts.add('header -' + line.strip('= ') + ' +' + newline.strip('= '))
                    Lsect[lang][etys[i]] = newline

            # sigh, think that's it? Sweet, if true...

            # general format
            newlines = [ ]

            inPos = inTrans = inPro = inext = defnext = False
            npos = 0
            ety = nety = 0
            levelact = ''
            rfctag = ''

            for line in Lsect[lang]:

                # move cats, may be something else on the line too, or multicats ...
                # first we need a cat-present predicate
                catp = False
                for cat in recat.findall(line):
                    catp = True
                    catname = cat[11:-2].split('|')[0]
                    catname = re.sub('_', ' ', catname).strip()
                    cf = cat.find('|')
                    if cf > 0: cat = '[[Category:' + catname + cat[cf:]
                    else: cat = '[[Category:' + catname + ']]'
                    # we have a canonical cat! is it a novel cat?
                    if cat in catseen:
                         acts.add('rm dup cat [[:' + cat[2:])
                         continue
                    catseen.add(cat)
                    # see if it belongs in a different sect
                    catmove = False
                    
                    if catname.find(':') >= 0:
                        catcode = catname.split(':')[0]
                        if catcode in Lcodes:
                            catlang = Lcodes[catcode]
                            if catlang != lang and catlang in Lcats:
                                acts.add('category ' + catname + ' moved to ' + catlang + ' section')
                                Lcats[catlang].append(cat)
                                catmove = True
                    elif not catname.lstrip(' 01').startswith(lang) and not catname.endswith('derivations'):
                        for other in Lcats:
                            if other == lang: continue
                            if catname.lstrip(' 01').startswith(other+' '):
                                acts.add('category ' + catname + ' moved to ' + other + ' section')
                                Lcats[other].append(cat)
                                catmove = True
                                break

                    # not moved
                    if not catmove: Lcats[lang].append(cat)
                if catp:
                    line = recat.sub('', line).strip()
                    if not line: continue

                # headers
                mo = reheader.match(line)
                if mo:
                    # hit header with no defn line in previous section?
                    if defnext and header not in HAN:
                        newlines.append('# {{defn|%s}}' % lang)
                        acts.add('no definition line for %s/%s added {defn}' % (lang, header))

                    level = len(mo.group(1))
                    header = mo.group(2).strip()
                    rest = mo.group(3)

                    # unlink header
                    hf = reunlink.sub(r'\1', header)
                    if hf != header:
                        if hf.find('|') > 0: hf = hf.split('|')[1]
                        acts.add('header -' + header + ' +' + hf)
                        header = hf

                    # fix header
                    if header.lower() in Hfix:
                        hf = Hfix[header.lower()]
                        if hf != header:
                            acts.add('header -' + header + ' +' + hf)
                            header = hf

                    # try a fuzzy!
                    if header.lower() not in Hfix and header[0:2] != '{{':
                        high = 0
                        replac = ''
                        hf = header.strip('[]{}').lower()
                        for val in sorted(Hfix):
                            # first character must match
                            if hf[0:1] != val[0:1]: continue
                            rawsc = fuzzy(hf, val)
                            if rawsc > high and rawsc > max(max(len(hf), len(val)) - 3, 5):
                                high = rawsc
                                replac = val
                        print 'fuzzy for %s: %s score %d' % (safe(hf), replac, high)
                        if high:
                            hf = Hfix[replac]
                            acts.add('header -' + header + ' +' + hf)
                            header = hf

                    # tag Transitive and Intransitive verb
                    if header.lower() in ('transitive verb', 'intransitive verb') and not rfc:
                        rfctag = '{{rfc-trverb|' + header + '}}'
                        rfc = True
                        # print "trans/intrans header: %s" % safe(header)

                    # tag X phrase
                    if header.endswith(' phrase') and not rfc:
                        rfctag = '{{rfc-xphrase|' + header + '}}'
                        rfc = True
                        # print "X phrase header: %s" % safe(header)

                    # rfc unrecognized, ignore templates for now, use NS later
                    if header.lower() not in Hfix and not rfc and header[0:2] != '{{':
                        rfctag = '{{rfc-header|' + header + '}}'
                        rfc = True
                        # print "unknown header: %s" % safe(header)

                    # min level, set and comp for nested ety
                    if level == 3 and header.startswith("Etymology") and etycount > 1:
                        ety = 1
                        nety += 1
                        npos = 0
                        push = False
                    else:
                        if ety:
                            # if we are in the last ety sect, and see end of section things at L3:
                            if level < 4 and nety == etycount and header in EOS: inPos = ety = 0
                            # and ... independent of connel flag, because we always push ;-)
                            if level < 4 and nety == etycount and header in L43: inPos = ety = 0

                        # push POS (or level 3?) sections down in ety, push flag because of Connel fix
                        # may be a good idea anyway ... yes, but if we rfc, stop
                        if ety and not rfc:
                            if ((header in POS and header not in HAN) or header in TOS) and level == 3:
                                level = 4
                                acts.add('header in ety sect ' + header + ' to L' + str(level))
                                if header == 'Pronunciation':
                                    rfctag = '{{rfc-level|check placement of Pronunciation}}'
                                push = True
                            elif ((header in POS and header not in HAN) or header in TOS):
                                # at correct level! (or too deep already)
                                push = False
                            elif push and header in Level and (level == 4 or level < Level[header] + ety):
                                level += 1
                                acts.add('header in ety sect ' + header + ' to L' + str(level))
                            elif level < 4: push = False

                        # code to shift header levels (general case in POS), disabled per Connel, 18.4.7
                        if inPos and header in L43:
                            if npos < 2 and level < 4 + ety:
                                if not Connel:
                                    level = 4 + ety
                                    acts.add('header ' + header + ' to L' + str(level))
                                else: levelact = ' (AutoFormat would have corrected level of ' + header +')'
                        elif inPos and header in Level:
                            if level < Level[header] + ety:
                                if not Connel:
                                    level = Level[header] + ety
                                    acts.add('header ' + header + ' to L' + str(level))
                                else: levelact = ' (AutoFormat would have corrected level of ' + header +')'
                        
                        # now tag remaining problems if any, various cases
                        # should all contain "+" for the re-visit trick ...
                        if not rfc and level == 4 + ety and not inPos and header in Level and header not in NS:
                            rfctag = '{{rfc-level|' + header + ' at L4+ not in L3 POS section' + levelact + '}}'
                        elif not rfc and level == 3 + ety and header.startswith('Translation'):
                            rfctag = '{{rfc-level|' + header + ' at L3+' + levelact + '}}'
                        elif not rfc and level == 5 + ety and not inTrans and header.startswith('Translations to'):
                            rfctag = '{{rfc-level|' + header + ' at L5+, not in Translations' + levelact + '}}'

                    # blank line
                    newlines.append('')

                    # header + anything else that wasn't blank
                    newlines.append('======'[0:level] + header + '======'[0:level])
                    if rest.strip():
                        if not rest.startswith('{{rfc-'): acts.add('moved stuff after ' + header + ' header')
                        newlines.append(rest.strip())
                    # pointless, already stripped: elif len(rest) and not acts: acts.add('rm spaces after header')
                    if rfctag:
                        acts.add('+' + rfctag)
                        rfc = True
                        newlines.append(rfctag)
                        rfctag = ''

                    # set flags:
                    inext = defnext = False
                    if level < 4 + ety and (header in POS or header.startswith(HT)):
                        inext = inPos = True
                        npos += 1
                    elif level < 4 + ety: inPos = False
                    inTrans = (header == 'Translations')
                    tt = False
                    inPro = (header == 'Pronunciation')
                    continue

                # look for inflection line
                if inext:
                    if (line.startswith('{{') and line != '{{wikipedia}}') or line.startswith("''") or fuzzy(line, 

title) > len(title) - 1:
                        inext = False
                        defnext = True
                        #print "infl: %s" % safe(line)
                    #else:
                    #   print "not infl: %s" % safe(line)
                    if line and line[0] == '#':
                        acts.add('added inflection line for %s/%s' % (lang, header))
                        if lang in Ltocode:
                            newlines.append(infline(title, Ltocode[lang], header))
                        else:
                            newlines.append(infline(title, '', header))
                        defnext = True
                        inext = False
                        # and also do next case for defnext
                    # elide blanks above inflection line
                    if not line: continue

                # look for definition lines
                if defnext and line and line[0] == '#':
                    newlines.append('')
                    defnext = False
                # serious stuff ...
                if len(line) > 2 and line[0] == '#' and line[1] not in (' ', ':', '*', '#'):
                    line = '# ' + line[1:]
                if line[0:2] == '# ':

                    # look for context tag
                    if lang in Ltocode:
                        ctxn = 1
                        mo = recontext.match(line)
                        if not mo:
                            ctxn = 2
                            mo = recontext2.match(line)
                        if not mo:
                            ctxn = 3
                            mo = recontext3.match(line)
                        if mo:
                            print "match context tag %s" % safe(mo.group(1))
                            tname = cpar(mo.group(1), Contexts)
                        if mo and tname:
                            if lang != 'English': tname += '|lang=' + Ltocode[lang]
                            if contextp and ctxn == 1:
                                acts.add("-(''" + mo.group(1) + "'') +{{" + tname + "}}")
                                line = recontext.sub(r'# {{' + tname + r'}} \2', line)
                            elif contextp and ctxn == 2:
                                acts.add("-''(" + mo.group(1) + ")'' +{{" + tname + "}}")
                                line = recontext2.sub(r'# {{' + tname + r'}} \2', line)
                            elif contextp and ctxn == 3:
                                acts.add("-{{italbrac|" + mo.group(1) + "}} +{{" + tname + "}}")
                                line = recontext3.sub(r'# {{' + tname + r'}} \2', line)
                            else: print "would have replaced %s with %s" % (safe(mo.group(1)), safe(tname))
                            # elide cats that correspond
                            for catname in tname.split('|'):
                                if catname == 'context' or catname[0:5] == 'lang=': continue
                                catname = catname[0:1].upper() + catname[1:]
                                # code is prefix ...
                                if lang != 'English': catname = Ltocode[lang] + ':' + catname
                                if contextp: 
                                    catseen.add('[[Category:' + catname + ']]')
                                    # catseen.add('[[Category:' + catname + 's]]')
                                    print "added catseen %s" % safe(catname)

                    # wikilinking?
                """
                elif line[0:1] != '#' and not inTrans and line.find("''") >= 0:
                    # look for italbrac cases not on defn lines
                    newl = reibcomma.sub(ibsub, line)
                    newl = reibcomma2.sub(ibsub, newl)
                    if newl != line:
                        # acts.add('-' + line + ' +' + newl)
                        # acts.add('template i')
                        # in pronunciation, use a, anywhere else, we want i-c if at start of * line
                        if inPro:
                            newl = re.sub(r'\{\{(i|i-c)\|', '{{a|', newl)
                        else:
                            newl = re.sub(r'\{\{i\|', '{{i-c|', newl)
                        acts.add(sdif(line, newl))
                        line = newl
                    # think that will work?
                 """
                # translations lines

                # stopgap check:
                if line.find('{{ttbc|') >= 0: inTrans = False

                if inTrans:
                    # special indent rule, we know there is a previous line
                    if line.startswith(':') and newlines[-1:][0].startswith('*'):
                        acts.add('-: +*: in trans')
                        line = '*' + line
                    was = False
                    mo = retrans1.match(line)
                    if not mo: mo = retrans2.match(line)
                    if mo: was = True
                    if not mo: mo = retrans3.match(line)
                    if mo:
                        tlang = mo.group(1).strip()
                        if was and tlang.find('|') > 0: tlang = tlang.split('|')[1]
                        trest = mo.group(2).strip()

                        if tlang.startswith('{{') and tlang[2:-2] in Lcodes:
                            acts.add('subst %s in trans' % tlang)
                            tlang = Lcodes[tlang[2:-2]]
                            was = False

                        if was and (tlang in Top40 or title == tlang):
                            acts.add('trans unlink ' + tlang)
                        elif not was and tlang in Classics and title != tlang:
                            tlang = '[[' + tlang + ']]'
                            acts.add('trans link ' + tlang)
                        elif was:
                            # leave as is (was)
                            tlang = '[[' + tlang + ']]'

                        tr = regender.sub(r'{{\1}}', trest)
                        if tr != trest:
                            #acts.add('gender -' + trest + ' +' + tr)
                            acts.add('gender ' + sdif(trest, tr))
                            trest = tr

                        if trest: line = '*' + tlang + ': ' + trest
                        else: line = '*' + tlang + ':'

                    # convert templates
                    # has to be a non-blank previous line, we are in trans section

                    if line == '{{rfc-trans}}': inTrans = False
                    if line == '{{checktrans}}': inTrans = False
                    if line == '{{checktrans-top}}': inTrans = False
                    if line == '{{ttbc-top}}': inTrans = False

                    mo = retopgloss.match(line)
                    if mo:
                        if mo.group(1):
                            gloss = mo.group(1)[1:]
                        else:
                            prev = newlines[-1:][0]
                            while not prev:
                               newlines = newlines[:-1]
                               prev = newlines[-1:][0]
                            if prev.startswith(';'): gloss = prev[1:]
                            elif prev.startswith("'''") and prev.endswith("'''"): gloss = prev[3:-3]
                            else: gloss = ''
                            if gloss: newlines = newlines[:-1]
                        if gloss:
                            gloss = reglossfix.sub(r'\1', gloss).strip()
                            prev = line
                            line = '{{trans-top|' + gloss + '}}'
                            # <- else: line = '{{trans-top}}'
                            acts.add('-' + prev + ' +' + line)
                            tt = True
                    if tt and line == '{{mid}}':
                        line = '{{trans-mid}}'
                    if tt and line == '{{bottom}}':
                        newlines.append('{{trans-bottom}}')
                        # add blank line
                        line = ''
                        tt = False

                    # end of trans

                # all else
                newlines.append(line)

            # at end with no defn line in previous section?
            if defnext and header not in HAN:
                newlines.append('# {{defn|%s}}' % lang)
                acts.add('no definition line for %s/%s added {defn}' % (lang, header))

            # done with sect
            Lsect[lang] = newlines

        # reassemble ...

        newtext = ''
        prior = False

        for line in Lsect['*prolog']:
            # no blank lines
            if line: newtext += line + '\n'
        del Lsect['*prolog']
        blank = True

        for lang in sorted(Lsect, key=lkey):
            if lang == '*iwiki': continue
            if prior:
                if not blank: newtext += '\n'
                newtext += '----\n\n'
                divs -= 1
            prior = True
            if lang not in waslinked: newtext += '==' + lang + '==\n'
            else: newtext += '==[[' + lang + ']]==\n'
            blank = False
            for line in Lsect[lang]:
                # no dup blank lines
                if line or not blank: newtext += line + '\n'
                if line: blank = False
                else: blank = True
            if len(Lcats[lang]):
                if not blank: newtext += '\n'
                for cat in sorted(Lcats[lang]): newtext += cat + '\n'
                blank = False
            del Lsect[lang]

        if not blank: newtext += '\n'
        for line in Lsect['*iwiki']:
            # no blank lines
            if line: newtext += line + '\n'

        if divs != 0: acts.add("fixed ----'s")

        # rfc-level, etc trickery
        for rfname in ('level', 'xphrase', 'header'):
            if 'rm rfc-' + rfname + ' tag' in acts:
                for ac in sorted(acts):
                    if ac.startswith('+{{rfc-' + rfname):
                        acts.remove('rm rfc-' + rfname + ' tag')
                        acts.remove(ac)
                        print 'elided -' + rfname + ' +' + rfname
                        break

        """
        # if a new page, and no other actions, do minor spacing
        # (if rfc-tag trickery, this won't apply anyway, page is not new)
        if title in newpages and not acts and newtext.rstrip(' \n') != text.rstrip(' \n'):
            acts.add('minor spacing')
        """

        # if we added a major rfc, just do that, dump the rest of the work!!
        if rfcact:
            acts = set()
            acts.add(rfcact)
            newtext = text

        act = ''
        for ac in sorted(acts): act += ', ' + ac
        act = act.strip(', ')

        # some change, write it (even just rm tag)
        if act:

            fixed += 1

            print "format %s: %s" % (safe(title), safe(act))

            # try to fix the entry
            try:
                wikipedia.setAction(act)
                page.put(newtext)
            except wikipedia.PageNotSaved:
                print "failed to save page"
                # other action?
                continue
            except socket.timeout:
                print "socket timeout, maybe not saving page"
                continue
            except socket.error:
                print "socket error, maybe not saving page"
                continue

        # end loop

    print "entries fixed %d" % fixed

    # done

if __name__ == "__main__":
    try:
        main()
    finally:
        wikipedia.stopme()