User:MZMcBride/Magic

From Wikipedia, the free encyclopedia

This page compiles some useful database queries, scripts, methods, and tools that are helpful when doing admin maintenance work.

Toolserver background[edit]

The Wikimedia Toolserver is a server with access to a (more or less) live copy of the database of all Wikimedia projects, for use with user scripts and utilities. Access is granted on request on a case-by-case basis.

There are three main servers: S1, S2, and S3. S1 holds the English Wikipedia; S2 holds Bulgarian Wikipedia and Wiktionary, Wikimedia Commons, Czech Wikipedia, German Wikipedia, English Wikiquote and Wiktionary, Esperanto Wikipedia, Finnish Wikipedia, Indonesian Wikipedia, Italian Wikipedia, Dutch Wikipedia, Norwegian Wikipedia, Polish Wikipedia, Portuguese Wikipedia, Swedish Wikipedia, Thai Wikipedia, Turkish Wikipedia, Chinese Wikipedia; S3 holds every other public Wikimedia wiki not previously listed.1

To have the queries below run, you can either sign up for a Toolserver account, you can use the Query service, or you can ask someone on their talk page or in the Toolserver IRC channel. For more information about the Toolserver, see http://wiki.ts.wikimedia.org/view/Main_Page and http://meta.wikimedia.org/wiki/Toolserver.

One important factor to observe when using the Toolserver is replication lag (replag). Because the Toolserver is a live copy of the Wikimedia servers, from time to time, it gets a bit behind (for whatever reason). If the replag is two weeks, it means that any data obtained will be as of two weeks ago. This can be problematic when dealing with orphaned talk pages and such, as they may have already been de-orphaned or deleted. To view the Toolserver's current replag, see here.

General links[edit]

CSD G2 and CSD G7[edit]

This is good for finding test pages (CSD G2) and blanked pages (CSD G7) as well as other quirky pages.

SELECT CONCAT(ns_name, ':', page_title, ' - ', page_len)
FROM page
JOIN toolserver.namespace ON page_namespace = ns_id AND dbname = 'enwiki_p'
LEFT JOIN templatelinks ON tl_from = page_id
WHERE page_len < 50
AND page_is_redirect = 0
AND page_namespace NOT IN (2, 3)
AND tl_from IS NULL
AND (SELECT COUNT(DISTINCT rev_user_text) FROM revision WHERE rev_page = page_id) = 1
ORDER BY page_len ASC;

Code courtesy of nl:User:Erwin and de:User:Duesentrieb.

Explanation: All pages that are not in the User: or User_talk: namespaces, are less than 50 bytes in length, have only one contributor, use no templates, and are not redirects.

CSD U2[edit]

Generate three lists: one of all users on en.wiki, one of all pages in User:, and one of all pages in User_talk:. Run dpn.py twice. The script will generate lists of pages in User: or User_talk: that do not correspond to a registered user (CSD U2).

SELECT user_name FROM user;
SELECT page_title FROM page WHERE page_namespace = 2;
SELECT page_title FROM page WHERE page_namespace = 3;

CSD G8[edit]

SELECT CONCAT('[[Talk:', p1.page_title, ']]')
FROM page p1
WHERE p1.page_namespace = 1
AND NOT EXISTS (SELECT * FROM page p2 WHERE p2.page_namespace = 0 AND p1.page_title = p2.page_title)
AND p1.page_title NOT LIKE "%/%";

Explanation: Finds all pages in the Talk: namespace whose titles don't contain a "/" and who do not have a corresponding subject-space page (CSD G8).

  • Find forgotten pages
    • SELECT page_title FROM page WHERE page_namespace = 0
      
    • SELECT page_title FROM page WHERE page_namespace = 1
      
    • Run "/.*$" on first list.
    • Run both lists through dpn.py.
    • Filter page titles that are redirects, don't exist, or contain "archive".

CSD R1[edit]

SELECT CONCAT('[[:',ns_name,':',p1.page_title,']] &rarr; [[:',ns_name,':',rd_title,']]')
FROM redirect AS rd JOIN page p1 ON (rd.rd_from=p1.page_id)
JOIN toolserver.namespace ON page_namespace = ns_id AND dbname = 'enwiki_p'
LEFT JOIN page AS p2 ON (rd_namespace=p2.page_namespace AND rd_title=p2.page_title )
WHERE rd_namespace >= 0
AND p2.page_namespace IS NULL
ORDER BY p1.page_namespace ASC;

Code courtesy of MediaWiki.

SELECT
p1.page_namespace AS namespace,
p1.page_title     AS title,
pl_namespace,
pl_title
FROM pagelinks AS pl
JOIN page p1 ON (p1.page_is_redirect=1 AND pl.pl_from=p1.page_id)
LEFT JOIN page AS p2 ON (pl_namespace=p2.page_namespace AND pl_title=p2.page_title )
WHERE p2.page_namespace IS NULL;

Code courtesy of MediaWiki.

Explanation: The first query just finds broken redirects and lists their target. The second query is good at finding pages where the redirect has text below it that should be moved to its target.

CSD C1[edit]

SELECT CONCAT('# [[:Category:', page_title, ']]')
FROM categorylinks RIGHT OUTER JOIN
(SELECT page_id, page_title FROM page
WHERE page_namespace = 14) pagetmp
ON cl_to = pagetmp.page_title
WHERE ISNULL(cl_to)
AND NOT EXISTS (SELECT 1 FROM categorylinks
WHERE cl_from = page_id AND cl_to = 'Wikipedia_category_redirects')
AND NOT EXISTS (SELECT 1 FROM categorylinks
WHERE cl_from = page_id AND cl_to = 'Disambiguation_categories');

Code courtesy of User:CBM.

Explanation: Lists empty categories that are empty and that are not members of Category:Wikipedia category redirects or Category:Disambiguation categories. After generating list, remove all titles that contain "(-importance|-class|non-article|assess)".

CSD T3[edit]

SELECT CONCAT("# [[Template:", page_title, "]]")
FROM page
LEFT JOIN templatelinks
ON page_namespace = tl_namespace AND page_title = tl_title
WHERE page_namespace = 10 AND tl_from IS NULL;

Code courtesy of MediaWiki and de:User:Duesentrieb.

Explanation: Lists all pages in the Template: namespace that have no transclusions.

Notes: Stub templates and templates that are called from other templates dynamically should not be deleted.

CSD G6[edit]

SELECT page_namespace, page_id, count(*), max(rev_timestamp)
FROM page
JOIN revision ON page_id = rev_page
LEFT JOIN templatelinks ON page_id = tl_from
WHERE page_len <= 10
AND rev_timestamp < '20080000000000'
AND tl_from is null
GROUP BY page_id HAVING COUNT(*) < 3;

Code courtesy of User:Autocracy.

Explanation: Lists all pages with no edits in 2008 (or later), are less than or equal to 10 bytes, have no transclusions, and have only 1 or 2 edits.

CSD R2[edit]

SELECT CONCAT('[[',pf.page_title,']] -- [[:{{ns:',pt.page_namespace,'}}:',rd_title,']]')
FROM redirect, page as pf, page as pt
WHERE pf.page_namespace=0
AND rd_title = pt.page_title
AND rd_namespace=pt.page_namespace
AND (pt.page_namespace != 0)
AND rd_from=pf.page_id
AND pf.page_namespace=0;

Code courtesy of User:ST47.

Explanation: Lists all redirects from the (Main) namespace to a non-(Main) namespace.

Notes: Many of the redirects listed are very old and are kept intentionally.

SELECT CONCAT('[[',pf.page_title,']] -- [[:{{ns:',pt.page_namespace,'}}:',rd_title,']]')
FROM redirect, page as pf, page as pt
WHERE pf.page_namespace=0
AND rd_title=pt.page_title
AND rd_namespace=pt.page_namespace
AND (pt.page_namespace=1 OR pt.page_namespace=2 OR pt.page_namespace=3)
AND rd_from=pf.page_id
AND pf.page_namespace=0;

Code courtesy of User:ST47.

Explanation: Lists all redirects from the (Main) namespace to Talk:, User:, or User_talk: (CSD R2).

Inaccurate file extensions[edit]

SELECT CONCAT(img_name, ' ', img_major_mime, ' ', img_minor_mime)
FROM image;

Code courtesy of User:CBM.

Explanation: Lists all images, their major MIME type, and their minor MIME type.

Filter out extensions that don't match the MIME type.

Orphaned subpages[edit]

Subpages without any incoming links generally indicate an issue. If the subpage is an archive, it should at least be linked from the active talk page.

SELECT CONCAT('Talk:', page_title)
FROM page
LEFT JOIN pagelinks
ON page_namespace=pl_namespace AND page_title=pl_title
WHERE pl_namespace IS NULL
AND page_namespace = 1;

Code courtesy of MediaWiki.

Explanation: This generates a list of all pages in the Talk: namespace that have no incoming links.

Category:Temporary Wikipedian userpages[edit]

# Copyright (c) 2008 Betacommand

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

import wikipedia
import catlib
import pagegenerators
import simplejson
import re
import codecs
def main():
  site = wikipedia.getSite()
  f = codecs.open('file_name_in.txt','r', 'utf-8') #
  names = f.read().split('\n')[:-1]
  f.close()
  gen = []
  g=0
  f2 = codecs.open('file_name_out.txt', 'a', 'utf-8')
  for name in names:
    page3 = wikipedia.Page(site,'%s' % name)
    gen.append(page3)
  for page in pagegenerators.PreloadingGenerator(gen, pageNumber = 500):
    try:
      text = page.get()
      stamp = page.editTime()
      if not re.search('(sock|Sock|checkuser|puppetmaster|Doppleganger|doppleganger|Doppelganger|doppelganger|VaughanWatch|\{\{bot|Wikipedia:Doppelganger_account|User:Jimbo Wales|WP:AC|Banned user|Banned Wikipedia|User Alternate Acct|\{\{banned)',text,re.I) and not stamp.startswith("200807") and not stamp.startswith("200808"):
        try:
          g +=1
          f2.write('# [['+ page.title() + ']]\n')
          wikipedia.output(u'%s. %s' % (g, page.title()))
        except UnicodeEncodeError:
          print "ERROR"
    except:
      print "ERROR"
  f2.close()

if __name__ == '__main__':
    try:
        main()
    finally:
        wikipedia.stopme()

Pages needing archiving[edit]

SELECT CONCAT(ns_name, ':', page_title, ' - ', page_len)
FROM page
JOIN toolserver.namespace ON page_namespace = ns_id AND dbname = 'enwiki_p'
WHERE page_len > 100000
AND page_namespace NOT IN (0, 2, 3, 4, 100)
AND page_title NOT LIKE "%rchive%";

Imageless pages[edit]

SELECT CONCAT("# [[",page_title,"]]")
FROM page
LEFT JOIN imagelinks ON page_id = il_from
WHERE page_namespace = 0
AND page_is_redirect = 0
AND imagelinks.il_to IS NULL;
SELECT
  CONCAT("# [[",page_title,"]]")
FROM page
LEFT JOIN imagelinks ON page_id = il_from
WHERE page_namespace = 0
AND page_is_redirect = 0
AND imagelinks.il_to IS NULL
AND NOT EXISTS (SELECT
                  1
                FROM pagelinks
                WHERE pl_from = page_id
                AND page_namespace NOT IN (1, 4, 5, 6, 7, 8, 9, 10, 11, 14));

Duplicate files[edit]

SELECT CONCAT(img_name,' -- ',img_sha1) FROM image;

Filter out the duplicates.

Inappropriate cross-namespace links[edit]

SELECT page_title FROM page WHERE page_namespace = 0 AND page_is_redirect = 0;

Explanation: Generates a list of all pages in the (Main) namespace that are not redirects.

# Copyright 2008 bjweeks

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

import API

import Wiki

wiki = Wiki.Wiki()
wiki.login('Username', 'Password')
output_file = open('output.txt', 'a')
    
def link_check(page_list):
    print 'Running query on set starting with %s' % page_list[0]
    params = {

	    'action': 'query',

	    'prop': 'links',

	    'titles': '|'.join([unicode(page.replace('_', ' '), 'utf-8').encode('ascii', 'xmlcharrefreplace') for page in page_list[:500]]),
	    'plnamespace': '2|3|7|8|9|13|15'

    }
    request = API.APIRequest(wiki, params)
    query = request.query(querycontinue=False)
    pages = query['query']['pages']
    for k,v in pages.iteritems():
        if 'links' in v:
            print v['title']
            output_file.write(v['title'].encode('ascii', 'xmlcharrefreplace') + '\n')
            output_file.flush()
    return page_list[501:]

f = open('all_pages.txt')
all_pages = f.read().split('\n')
f.close()

while all_pages:
    all_pages = link_check(all_pages)
output_file.close()

Explanation: This script fills the void left by the disabling of Special:CrossNamespaceLinks. Using the query above saved to a file called "all_pages.txt", it goes through each non-redirect in the (Main) namespace and determines if there are links to the User:, User_talk:, Image_talk:, MediaWiki:, MediaWiki_talk:, Help_talk:, or Category_talk: namespaces. If it finds such links, it lists the page in a file called "output.txt". To run this script, replace "Username" and "Password" with the appropriate data.

Note: This script requires the API.py and Wiki.py modules. It also requires an account with the 'apihighlimits' right (usually the sysop and bot groups).

Users by edit count[edit]

SET @counter:=0; SELECT @counter:=@counter+1 AS rank, user_name, user_editcount
FROM user
WHERE user_id NOT IN (SELECT ug_user FROM user_groups WHERE ug_group = 'bot')
ORDER BY user_editcount DESC LIMIT 4000;

Explanation: This query will list the first 4,000 users not currently in the 'bot' user group by descending edit count.

Users with edits only in the User: namespace[edit]

SELECT
  CONCAT("# [[User:",user_name,"]] -- ",
  page_len)
FROM user
JOIN page
ON user_name = REPLACE(page_title, '_', ' ')
AND page_namespace = 2
WHERE user_editcount = 0
AND NOT EXISTS
  (SELECT
     1
   FROM revision
   JOIN page
   ON rev_page = page_id
   WHERE rev_user_text = user_name AND NOT page_namespace in (2));

Explanation: This query will list all users who have (non-deleted) contributions only in the User: namespace.

Miscellaneous Python scripts[edit]

Some of these Python scripts require the pywikipedia framework.

Incoming links[edit]

#!/usr/bin/env python2.5

# Copyright 2008 bjweeks

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

import datetime
import httplib

import simplejson

def request(url):
        conn = httplib.HTTPConnection('en.wikipedia.org')
        conn.request('GET', url)
        response = conn.getresponse()
        data = response.read()
        conn.close()
        return data
            
def api(**kwargs):
        data = request('/w/api.php?action=query&format=json&%s' % '&'.join(['%s=%s' % (k, v) for k,v in kwargs.iteritems()]))
        if data:
            query = simplejson.loads(data)['query']
            return query
        else:
            return api(**kwargs)
            
def nobacklinks(template):
    query = api(list='backlinks', bltitle=template, bllimit=1)
    if query['backlinks']:
        return True
    return False
                
input_file = open('file_name_in.txt', 'r')
input_list = [line.strip() for line in input_file.readlines()]
input_file.close()
output_file = open('file_name_out.txt', 'a')
try:
    for template in input_list:
        if not nobacklinks(template):
            print '[[%s]] is completely orphaned.'  % template
            output_file.write(template+'\n')
        input_list.remove(template)
except:
    raise
finally:
    input_file = open('file_name_in.txt', 'w')
    input_file.write('\n'.join(input_list))
    input_file.close()
    output_file.close()

Explanation: Using an input list named "file_name_in.txt" with "FullPageName" on each line, this script checks if the page has any incoming links. If the page has no incoming links, the page is added to a file named "file_name_out.txt".

Open proxies[edit]

# Copyright (c) 2008 Betacommand

# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:

# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.

# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

import sys, wikipedia, userlib

if __name__ == '__main__':
  try:
    site = wikipedia.getSite()
    f = open(sys.argv[1],'r')
    names = f.read().split('\n')[:-1]
    for name in names:
      user = userlib.User(site,name)
      try:
        print "Blocking [[User:%s]]..." % name
        user.block(expiry='2 years', reason='{{blockedproxy}}', anonOnly=False, noSignup=True, enableAutoblock=True)
      except:
        pass
  finally:
    wikipedia.stopme()

Explanation: This script requires an input file with an IP address on each line. It blocks each IP for two years using the block reason {{blockedproxy}}.

dpn.py[edit]

import shelve
store = shelve.open("storefile")
import os;os.remove('storefile')
mismatches = []
for line in open("file_name_a.txt","r"):
    store[line.rstrip()] = None
for line in open("file_name_b.txt","r"):
    if line.rstrip().split("/")[0] not in store:
        mismatches.append(line)
for mismatch in mismatches:
    print mismatch.rstrip()
store.close()

Code courtesy of dpn`` in #python.

Explanation: This script takes two lists. The first input file contains a list where no entry has a "/". The second input file has a list of similar entries, but some will contain a "/". The script truncates everything following a "/" in the second input file, and compares it to the first input file. It outputs all mismatches. For example:

Input 1 Input 2 Output
  • Foo
  • Bar
  • Baz
  • Bat
  • Foo/a
  • Bar/a/b
  • Baz
  • Ping
  • Ping

This script is incredibly helpful for finding orphaned subpages (CSD G8 or CSD G6) and pages that belong to users who do not exist (CSD G2).

API.py[edit]

# Copyright 2008 Mr.Z-man

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

# -*- coding: utf-8 -*-
import urllib2, simplejson, re, time, cookielib
from urllib import urlencode

class APIError(Exception):
	"""Base class for errors"""

class APIRequest:
	"""
	A request to the site's API
	wiki - A Wiki object
	data - API parameters in the form of a dict
	maxlag is set by default to 5 but can be changed
	format is always set to json
	"""
	def __init__(self, wiki, data):
		self.sleep = 5
		self.data = data
		self.data['format'] = "json"
		if not data.has_key('maxlag'):
			self.data['maxlag'] = "5"
		self.encodeddata = urlencode(self.data)
		self.headers = {
			"Content-type": "application/x-www-form-urlencoded",
			"User-agent": "MediaWiki-API-python/0.1",
			"Content-length": len(self.encodeddata)
		}
		self.wiki = wiki
		self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(wiki.cookies))
		self.request = urllib2.Request(wiki.apibase, self.encodeddata, self.headers)

	def query(self, querycontinue=True):
		"""
		Actually do the query here and return usable stuff
		"""
		
		data = False
		while not data:
			rawdata = self.__getRaw()
			data = self.__parseJSON(rawdata)
		#Certain errors should probably be handled here...
		if data.has_key('error'):
			raise APIError(data['error']['code'], data['error']['info'])
		if data.has_key('query-continue') and querycontinue:
			data = self.__longQuery(data)
		return data
	
	def __longQuery(self, initialdata):
		"""
		For queries that require multiple requests
		FIXME - queries can have multiple continue things....
		http://en.wikipedia.org/w/api.php?action=query&prop=langlinks|links&titles=Main%20Page&redirects&format=jsonfm
		"""
	
		totaldata = [initialdata]
		key1 = initialdata['query-continue'].keys()[0]
		key2 = initialdata['query-continue'][key1].keys()[0]
		if isinstance(initialdata['query-continue'][key1][key2], int):
			querycont = initialdata['query-continue'][key1][key2]
		else:
			querycont = initialdata['query-continue'][key1][key2].encode('utf-8')
		while querycont:
			self.data[key2] = querycont
			self.encodeddata = urlencode(self.data)
			self.headers['Content-length'] = len(self.encodeddata)
			self.request = urllib2.Request(self.wiki.apibase, self.encodeddata, self.headers)
			newdata = self.query(False)
			totaldata.append(newdata)
			if newdata.has_key('query-continue') and newdata['query-continue'].has_key(key1):
				querycont = newdata['query-continue'][key1][key2]
			else:
				querycont = False
		return totaldata
					
	def __getRaw(self):
		data = False
		while not data:
			try:
				data = self.opener.open(self.request)
			except:
				if self.sleep == 60:
					print("Aborting")
					return
				else:
					print("Server error, trying request again in "+str(self.sleep)+" seconds")
					time.sleep(self.sleep+0.5)
					self.sleep+=5
		self.response = data.info()
		return data

	def __parseJSON(self, data):
		maxlag = True
		while  maxlag:
			try:
				maxlag = False
				response = simplejson.loads(data.read())
				if response.has_key('error'):
					error = response['error']['code']
					if error == "maxlag":
						lagtime = re.search("(\d+) seconds", response['error']['info']).group(1)
						print("Server lag, sleeping for "+lagtime+" seconds")
						maxlag = True
						time.sleep(int(lagtime)+0.5)
			except: # Something's wrong with the data....
				return False
		return response

	def setUserAgent(self, useragent):
		"""
		Function to set a different user-agent
		"""
		self.headers['User-agent'] = useragent

Wiki.py[edit]

# Copyright 2008 Mr.Z-man

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

# -*- coding: utf-8 -*-
import cookielib, API

class WikiError(Exception):
	"""Base class for errors"""

class BadTitle(WikiError):
	"""Invalid title"""
	
class NoPage(WikiError):
	"""Non-existent page"""

class EditError(WikiError):
	"""Problem with edit request"""

class Wiki:
	"""
	A Wiki site
	url - A URL to the site's API, defaults to en.wikipedia
	"""	
	def __init__(self, url="http://en.wikipedia.org/w/api.php"):
		self.apibase = url
		self.cookies = cookielib.CookieJar()
		self.username = ''
	
	def login(self, username, password = False, remember = True):
		"""
		Login to the site
		remember - currently unused
		"""
		
		if not password:
			from getpass import getpass
			password = getpass()
		data = {
			"action" : "login",
			"lgname" : username,
			"lgpassword" : password
		}
		req = API.APIRequest(self, data)
		info = req.query()
		if info['login']['result'] == "Success":
			self.username = username
		else:
			try:
				print info['login']['result']
			except:
				print info['error']['code']
				print info['error']['info']
	
	def isLoggedIn(self, username = False):
		"""
		Verify that we are a logged in user
		username - specify a username to check against
		"""
		
		data = {
			"action" : "query",
			"meta" : "userinfo",
		}
		req = API.APIRequest(self, data)
		info = req.query()
		if info['query']['userinfo']['id'] == 0:
			return False
		elif username and info['query']['userinfo']['name'] != username:
			return False
		else:
			return True
		
class Page:
	""" A page on the wiki
	wiki - A wiki object
	title - The page title, as a string
	check - Checks for existence, normalizes title
	followRedir - follow redirects (check must be true)
	"""	
	def __init__(self, wiki, title, check=True, followRedir = True):
		self.limit = '5000' #  FIXME:There needs to be a way to set this based on userrights
		self.wiki = wiki
		self.title = title.encode('utf-8')
		self.wikitext = ''
		self.templates = ''
		self.pageid = 0 # The API will set a negative pageid for bad titles
		self.exists = True # If we're not going to check, assume it does
		if check:
			self.setPageInfo(followRedir)


	def setPageInfo(self, followRedir=True):
		"""
		Sets basic page info, required for almost everything
		"""
		
		params = {
			'action': 'query',
			'titles': self.title,
			'indexpageids':'1'
		}
		if followRedir:
			params['redirects'] = '1'
		req = API.APIRequest(self.wiki, params)
		response = req.query()
		if response['query'].has_key('normalized'):
			self.title = response['query']['normalized'][0]['to'].encode('utf-8')
		if followRedir and response['query'].has_key('redirects'):
			self.title = response['query']['redirects'][0]['to'].encode('utf-8')
		self.pageid = response['query']['pageids'][0]
		if not self.title:
			self.title = response['query']['pages'][self.pageid]['title'].encode('utf-8')
		if response['query']['pages'][self.pageid].has_key('missing'):
			self.exists = False
		if response['query']['pages'][self.pageid].has_key('invalid'):
			raise BadTitle(self.title)
		if response['query']['pages'][self.pageid].has_key('ns'):
			self.namespace = response['query']['pages'][self.pageid]['ns']
			
	def getWikiText(self, expandtemplates=False, force=False):
		"""
		Gets the Wikitext of the page
		expandtemplates - expand the templates to wikitext instead of transclusions
		force - load the text even if we already loaded it before
		"""
	
		if self.wikitext and not force:
			return self.wikitext
		if self.pageid == 0:
			self.setPageInfo(followRedir=False)
		if not self.exists:
			return self.wikitext
		params = {
			'action': 'query',
			'prop': 'revisions',
			'rvprop': 'content',
			'pageids': self.pageid,
			'rvlimit': '1'
		}
		if expandtemplates:
			params['rvexpandtemplates'] = '1'
		req = API.APIRequest(self.wiki, params)
		response = req.query(False)
		self.wikitext = response['query']['pages'][self.pageid]['revisions'][0]['*'].encode('utf-8')
		return self.wikitext

	def getTemplates(self, force=False):
		"""
		Gets all list of all the templates on the page
		force - load the list even if we already loaded it before
		"""
	
		if self.templates and not force:
			return self.templates
		if self.pageid == 0:
			self.setPageInfo()
		if not self.exists:
			raise NoPage
		params = {
			'action': 'query',
			'prop': 'templates',
			'pageids': self.pageid,
			'tllimit': self.limit,
		}
		req = API.APIRequest(self.wiki, params)
		response = req.query()
		self.templates = []
		if isinstance(response, list): #There shouldn't be more than 5000 templates on a page...
			for page in response:
				self.templates.extend(self.__extractTemplates(page))
		else:
			self.templates = self.__extractTemplates(response)
		return self.templates
	
	def __extractTemplates(self, json):
		list = []
		for template in json['query']['pages'][self.pageid]['templates']:
			list.append(template['title'].encode('utf-8'))
		return list
	
	def edit(self, newtext=False, prependtext=False, appendtext=False, summary=False, section=False, minor=False, bot=False, basetime=False, recreate=False, createonly=False, nocreate=False, watch=False, unwatch=False):
		"""
		Edit the page
		Most params are self-explanatory
		basetime - set this to the time you loaded the pagetext to avoid
		overwriting other people's edits in edit conflicts
		"""
	
		if not newtext and not prependtext and not appendtext:
			raise EditError
		if prependtext and section:
			raise EditError
		if createonly and nocreate:
			raise EditError
		token = self.getToken('edit')
		from hashlib import md5
		if newtext:
			hashtext = newtext
		elif prependtext and appendtext:
			hashtext = prependtext+appendtext
		elif prependtext:
			hashtext = prependtext
		else:
			hashtext = appendtext
		params = {
			'action': 'edit',
			'title':self.title,
			'token':token,
			'md5':md5(hashtext).hexdigest(),
		}
		if newtext:
			params['text'] = newtext.encode('utf-8')
		if prependtext:
			params['prependtext'] = prependtext.encode('utf-8')
		if appendtext:
			params['appendtext'] = appendtext.encode('utf-8')
		if summary:
			params['summary'] = summary.encode('utf-8')
		if section:
			params['section'] = section.encode('utf-8')
		if minor:
			params['minor'] = '1'
		else:
			params['notminor'] = '1'
		if bot:
			params['bot'] = '1'
		if basetime:
			params['basetimestamp'] = basetime.encode('utf-8')
		if recreate:
			params['recreate'] = '1'
		if createonly:
			params['createonly'] = '1'
		if nocreate:
			params['nocreate'] = '1'
		if watch:
			params['watch'] = '1'
		if unwatch:
			params['unwatch'] = '1'
		req = API.APIRequest(self.wiki, params)
		result = req.query()
		return result		
	
	def getToken(self, type):
		""" 
		Get a token for everything except blocks and rollbacks
		type (String) - edit, delete, protect, move, block, unblock, email
		Currently all the tokens are interchangeable, but this may change in the future
		"""
			
		if self.pageid == 0:
			self.setPageInfo()
		if not self.exists:
			raise NoPage
		params = {
			'action':'query',
			'pageids':self.pageid,
			'prop':'info',
			'intoken':type,
		}
		req = API.APIRequest(self.wiki, params)
		response = req.query()
		token = response['query']['pages'][self.pageid][type+'token']
		return token

To-do[edit]

  • pywikipedia
  • CAT:TEMP header (with stuff above)
  • Pages and subpages of indefinitely blocked users (User: and User_talk:)
    • Caveat: Users blocked indefinitely for dying
    • Userboxes that are still actively transcluded
  • image hash duplicates (local and vs. Commons)

Discoveries[edit]

In doing maintenance work on a database this large, occasionally pages are missed by RecentChanges patrollers and such. The following is a list of pages that have stayed around that should have been deleted immediately:

Typos in the main article on the subject: