User:Graham87/Import

From Wikipedia, the free encyclopedia
Jump to: navigation, search

I often import old edits that are not in the English Wikipedia database from older versions of Wikipedia to restore missing edits. In almost all cases, this only affects the page history, and the page content is not affected, but see principle #10 below. This page describes how I handle certain situations when importing this old history.

Types of import[edit]

There are two methods of importing, either via another wiki (transwiki import) or via an XML file on a user's computer (upload import). The former is available to all administrators on the English Wikipedia per bug 20280, while the latter is generally restricted to stewards because it can very easily be used to falsify page histories, among other things. I have been granted the right to use upload imports per this discussion; this right allows me to import edits from the 2002 and 2003 database dumps, among other things (see below).

Nostalgia Wikipedia[edit]

The Nostalgia Wikipedia is a copy of the Wikipedia database from 20 December 2001, when Wikipedia used UseModWiki rather than MediaWiki. In the following text, "nost.wp" means the Nostalgia Wikipedia and "en.wp" means the English Wikipedia.

Principles[edit]

  1. When an edit has exactly the same timestamp and username at en.wp and nost.wp, it will not be imported. An exception to this rule is when the username of the editor contains an underline or an initial lower-case letter (see point 3). Therefore, the software only imported 8 revisions to "transport", even though the history at nost.wp contains 12 edits. Note that all timestamps on the Nostalgia Wikipedia are in UTC.
  2. In transwiki importing, the name of a page cannot be changed during importation, but a page can be imported to any arbitrary namespace. Almost all edits in the Nostalgia Wikipedia are in the main namespace, because namespaces did not exist in 2001. See this log entry for an import of a user page. In upload importing, the name of a page can be changed by editing the XML file.
  3. With some users from 2001, such as Larry_Sanger and office.bomis.com, their username contained underlines or initial lower-case letters, which are still present in both the en.wp and nost.wp databases (see bug 323). Edits with usernames containing underlines or those that begin with a lower-case letter cannot be accessed using Special:Contributions. When such edits are imported, any underlines in usernames are converted to spaces and initial lower-case letters are converted to upper-case letters, which may cause duplicate edits to appear in the page history.
  4. Talk pages were usually at the title "Pagename/Talk" in December 2001, rather than "Talk:Pagename" as they are now. See point 2. Some talk pages were at the title "Pagename/talk"; these were swallowed up during the conversion to the MySQL software. I have imported all the relevant content and history in those pages from the Nostalgia Wikipedia.
  5. Most subpages usually had titles in the form "Subpage/Test", where the part of the title after the slash began with an upper-case letter. However, if it began with a lower-case letter, like "Subpage/test, its UseModWiki edits weren't imported during the mass-import of these old edits in September 2002. I have imported all the relevant edits from the Nostalgia Wikipedia.
  6. The last edit before the conversion from UseModWiki to Phase II software, a forerunner of MediaWiki, does not appear in the current English Wikipedia database, and Conversion script appears to perform that edit instead (see this discussion about the topic). When such an edit appears in the Nostalgia Wikipedia but not the English Wikipedia, it can be imported; see the import log of "hour").
  7. When an edit is imported, it gets a new revision ID, higher than most edits in the page history. Before MediaWiki 1.18, when checking diffs, the part of MediaWiki that calculates the number of intermediate revisions worked using revision ID's, not dates, so the number of intermediate revisions was reported incorrectly (as noted in bug 2930). This diff at "Dundee" was a good example. As described at bug 2930, Navigation using the previous/next edit feature is still affected, since that feature finds the previous/next edit by revision ID, not by date.
  8. Diffs between imported and non-imported edits may show additions or removals of line breaks; for an example, see this diff at "river".
  9. If an edit is imported where the username is not registered, the username will be recorded with an ID of 0, like an IP address. Compare this edit at "God" and this edit at "Lower peninsula of Michigan". It is generally a good idea to register these ancient account names to prevent impersonation. Also, when Wikipedia used UseModWiki, IP addresses were recorded with the last octet replaced by "xxx", like "127.0.0.xxx" instead of "127.0.0.1" (see bug 3631).
  10. When a revision is imported over an existing page, the import usually has no effect on the page's content. However, when the latest edit to the original page occurred before the latest edit to the import source (in this case the Nostalgia Wikipedia), the content of the original page is replaced with the content of the latest imported edit. See Wikipedia talk:Historical archive/Wikipedia teamwork for an example of this situation.
  11. The change in number of bytes for imported edits in the page history is incorrect. See an example at "Bob Jones University" and bug 36976.

Overlapping edits and mismatched titles (or why I make strange page moves)[edit]

Overlapping edits can occur in the following three circumstances:

Note: In the second and third instances, if the duplicate edits are merged, it is impossible for an admin to separate them because they have the same timestamp.

To deal with these overlapping edits, when I only had the transwiki import right, I would normally followed this procedure, where "pagename" was the name of the page (importing by upload allows me to change the name of the page manually):

  • Is it worth it? If there is only one non-overlapping revision out of twenty, and it doesn't contain any useful information, then there's no need to import it and many duplicate edits will be created. Clutter in the Wikipedia database should be minimised where possible
  • Import the page using Special:Import. When asked for the namespace, select "MediaWiki talk", because it is not well-used and the chance of a title collision is almost zero. It is a good idea to check for an existing MediaWiki talk page at the target title prior to importation, though, since MediaWiki messages can have very strange titles and comments may be left randomly by users unfamiliar with the main MediaWiki namespace discussion points. Also, there may be remnants of the time when the MediaWiki namespace functioned like the template namespace; these have mostly been deleted already, but be careful of these edits when doing history merges.
  • Move the page in the English Wikipedia to the MediaWiki talk page containing the history from the edits from nost.wp. Ask MediaWiki to delete that page while performing the move.
  • Undelete the earliest edits from the MediaWiki talk page, taking care not to undelete duplicate edits.
  • Move the page back to the original title, without the redirect, since a redirect from the MediaWiki talk to the main namespace is useless.

I use a similar procedure when the page titles on en.wp and nost.wp differ. If there are no overlapping edits in the two page histories, it's often easier to move the English Wikipedia page to the nost.wp title before importing the edits. When importing talk pages, I import them to the main namespace, since very few encyclopedic articles in the English Wikipedia end with the title "/Talk".

How I find edits to import[edit]

A major source of edits to import is the automatically generated list of pages with the most revisions on the Nostalgia Wikipedia; as of 16 December 2011, I have analysed all of the 5,000 pages on that list for edits worth importing; the list consists of 38,765 edits out of the 93,105 old edits in the Nostalgia Wikipedia database (including those by Conversion script), only 41.6% of the total number of edits. Another way to find edits that should be imported is to check the contributions of editors who have edited between 20 December 2001, the last old edit in the Nostalgia Wikipedia, and 25 January 2002, when UseModWiki was replaced with the Phase II software. The latter method can yield results because under the KeptPages system, when an edit was made when Wikipedia used UseModWiki, older edits would be deleted to make room for the newer edit.

Other database dumps[edit]

I have downloaded the English Wikipedia dumps from 2002 and 2003 that are available from the above-linked site and installed them on a local copy of MediaWiki 1.21.1, the latest release version of the software available when I started working on this project on 9 June 2013. I used them to retrieve old edits that had gone missing from Wikipedia. I wrote some tips about upgrading from very old database schemas at the relevant section of the MediaWiki manual about upgrading.

Trivia[edit]