Wikipedia talk:Database reports/Archive 4

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

usernames vs articles

Would there be any way to locate instances where both (X Y) and (user:X Y) exist, and (X Y) is a living person? DS (talk) 14:21, 7 June 2010 (UTC)

I did fuzzy matching between article creators and article titles at some point. The data is available here. It'd be neat if something similar could be done with usernames and BLP titles, though you're dealing with much larger sets, so I'm not sure how technically feasible it is.
In general, there are two important considerations to make with a task like this: false positives and false negatives. The false positives come from common names (most are), causing coincidental collisions that aren't of any importance. The false negatives come from two sources that I can think of off-hand: the addition of middle names or inclusion of full names in page titles and page titles that disambiguate (e.g. "John Smith (minister)"). The middle name issue is offset by fuzzy matching; the disambiguation issue is offset by stripping. --MZMcBride (talk) 18:25, 7 June 2010 (UTC)
If those users did not edit articles about themselves, then identifying them can raise privacy issues. Sole Soul (talk) 22:41, 7 June 2010 (UTC)
How so? --MZMcBride (talk) 00:33, 8 June 2010 (UTC)

Link to the userrights screen would be keen. Filling in the reason as 'autoconfirmed' even moreso. –xenotalk 01:28, 14 June 2010 (UTC)

I've changed the report to use {{dbr link}}, allowing any user to customize the output. Other reports will hopefully eventually use this template as well. --MZMcBride (talk) 15:28, 16 June 2010 (UTC)
Awesomeness. –xenotalk 15:31, 16 June 2010 (UTC)

Suggestion: sort by "time added"

This is not a request, it is a suggestion concerning all applicable reports. The suggestion is to add a column which marks when an item was first added to the report.

The rationale: If an item has been reported in the last 10 updates of a daily report, chances are that more than one user has checked the item.

If this suggestion was accepted, there is no need to apply it retroactively for the current items. Sole Soul (talk) 03:19, 7 July 2010 (UTC)

I can't see any simple or clean way of doing this right now. I'll have to give this more thought. In general, I'd say that better reports are preferred to just adding a timestamp column. That is, if certain results keep showing up and are getting in the way, it'd be good to find a way to make that not happen rather than simply point out to the user how long these items have been getting in the way. Is there a particular report or set of reports where you think this feature would be useful? --MZMcBride (talk) 04:45, 13 July 2010 (UTC)
Thank you. My impression is that this is a bot function not a database query function, am I right? Sole Soul (talk) 05:51, 13 July 2010 (UTC)
I'm not sure I understand the question. The MediaWiki database doesn't track when items are added to a page. It would have to be stored in a separate table somewhere, I imagine. --MZMcBride (talk) 05:52, 13 July 2010 (UTC)
When I made my suggestion I presumed that it would be done by programming BernsteinBot to compare the last update of a report with the new update, and I just asked to know if my understanding was correct or not. Sole Soul (talk) 06:05, 13 July 2010 (UTC)
Yes, it would do something like that. BernsteinBot would probably maintain a database in which it stored database report title, a key of some sort (maybe a cryptographic hash of the table row]]), and a timestamp. Then the bot would compare any updates to the database table and update the database table or wiki table as necessary. Or something like that. It's not impossible to code, but it's not a trivial task and I'm not sure I see the benefit (yet). --MZMcBride (talk) 06:08, 13 July 2010 (UTC)
Users go through a large report either from the top down or the bottom up or randomly. With time, this will make the bulk of the top and bottom items stabilize (i.e. they will be repeated because many users have checked them). This means that small percent of items are checked many times while the rest of items checked a few times or never. Sole Soul (talk) 06:37, 13 July 2010 (UTC)

More long pages

Is it possible to get more long talk pages for Wikipedia:Database reports/Long pages in section "Specified talk pages"? It would be preciated, since the list is good to check for pages that needs to be archived and reduced in size. What about change the size limit from 175,000 bytes to 150,000 bytes? --Kslotte (talk) 16:06, 12 July 2010 (UTC)

Sure, done. --MZMcBride (talk) 04:44, 13 July 2010 (UTC)
Thanks, 300+ entries found. --Kslotte (talk) 10:58, 14 July 2010 (UTC)
Could you also exclude pages that transclude {{ Historical}}?. Tim1357 talk 10:26, 16 July 2010 (UTC)
I don't see any point why to exclude that. If it is about getting some Wikipedia talk "permanent archives" excluded, then excluding historical template alone isn't enough. There exist also many other templates. --Kslotte (talk) 10:52, 16 July 2010 (UTC)
Sounds a bit like "perfect is the enemy of done." If we can say definitively that pages marked with {{historical}} should be excluded, then there's no reason to list them, even if we may miss other pages that aren't properly marked or are marked using a different template. --MZMcBride (talk) 01:31, 18 July 2010 (UTC)

Perhaps exclude all noindexed pages. Most of the permenant archive templates dump the pages into the noindex category. Tim1357 talk 01:01, 18 July 2010 (UTC)

I don't know about this. A lot of pages get {{NOINDEX}}'d for a lot of reasons. --MZMcBride (talk) 01:31, 18 July 2010 (UTC)

The list have have been reduced. It is time to take next step and lower the size limit to 140K for Wikipedia:Database reports/Long pages in section "Specified talk pages". --Kslotte (talk) 09:42, 28 August 2010 (UTC)

And, could the {{pagelinks}} be used instead of {{pler}}? What is the idea with the raw format? do we need it? --Kslotte (talk) 22:55, 29 August 2010 (UTC)
Changed the limit per your request. I also set the report to use {{dbr link}}, so you can customize the link output to your heart's desire. --MZMcBride (talk) 01:12, 30 August 2010 (UTC)
I did this change, making use of {{pagelinks}} and an additional link to the raw version. --Kslotte (talk) 17:21, 30 August 2010 (UTC)
I did revert it. Seems like there is too many transculations on one page. Links at end of page won't work. Possible some wiki function restrictions. --Kslotte (talk) 18:31, 20 September 2010 (UTC)

The list have have been further reduced. It is time to reduce size limit to 130K for Wikipedia:Database reports/Long pages in section "Specified talk pages". --Kslotte (talk) 21:17, 21 October 2010 (UTC)

Atypical deletion log actions (configuration) has been disabled. When RevDelete was only enabled for oversighters, this report made a lot more sense. Now it has simply become unwieldy and ineffective. bugzilla:17293 would address this issue in MediaWiki itself, so write your Congressmen about getting that fixed. --MZMcBride (talk) 04:59, 13 July 2010 (UTC)

Can we get a petition going? Killiondude (talk) 19:42, 18 July 2010 (UTC)
With God, all things are possible. --MZMcBride (talk) 20:26, 18 July 2010 (UTC)

Most-linked redirects

Would it be possible to have a database of redirects that have the most links to them in articles? For example, finding out how many articles link to Dusty Rhodes (baseball player) when those should all link to Dusty Rhodes (baseball). Running it weekly would work fine. Wizardman Operation Big Bear 16:59, 17 July 2010 (UTC)

The pagelinks table is one of the largest. As a basis for comparison, all non-deleted revisions equal roughly 328,876,236 rows. The pagelinks table is currently roughly 507,145,981 rows. So any queries on it generally don't work so well. I'm testing the query to get this data now, but it's been running for a few hours and I'm not sure if it will finish before it gets killed. We'll see. --MZMcBride (talk) 20:58, 17 July 2010 (UTC)
Diddo MZMcBride. Finding things like most linked require a function that takes friggin forever to complete. This is because it needs to find how many links exist for every single redirect before it can spit out a list of the most linked. I suggest a list of redirects with over a certian ammount of links (say 500 links), limited to the first 100 found. Tim1357 talk 01:00, 18 July 2010 (UTC)
You mean ditto, not diddo. --MZMcBride (talk) 17:11, 18 July 2010 (UTC)
Tim1357 talk 18:46, 18 July 2010 (UTC)
So, I ran this query, but it took about 20 hours. And the result set wasn't ordered correctly, so I have to run it again. I'm not sure I see the need for this to be a proper database report. I may just throw it in a sandbox. Dunno. --MZMcBride (talk) 17:11, 18 July 2010 (UTC)
20 hours seems like a bit much for such a low priority type of report. Maybe just have this be a "as requested" report. Tim1357 talk 18:46, 18 July 2010 (UTC)
I tried it without any filters and after 3 hours I got a 60 MB file with 2,566,221 rows. However, I'm unsure how to filter the set. Plus we don't fix redirects. — Dispenser 23:21, 21 July 2010 (UTC)
3 hours? Can you head or tail the file and post results somewhere? I'm also not sure what "filter the set" means. Maybe paste the query you used? :P --MZMcBride (talk) 01:12, 22 July 2010 (UTC)
I looked at the query you used. I think you missed the "in articles" part. Or maybe I mis-read the request. It seems like he only wants to count links from other articles, though. --MZMcBride (talk) 11:10, 22 July 2010 (UTC)
It takes 5.2 hours searching only the article namespace, but only 15 minutes when both the page and redirect ends with ')'. However, the requester hasn't come back to clarify what he's going to do with it. — Dispenser 03:10, 29 July 2010 (UTC)

Images without categories

Would it be possible to have a request run to find images with no categories whatsoever, including categories from templates? In theory, every image should be categorised based on its licensing; if it's not, it'll fail NFCC#10b, as all the copyright tags include licensing categories (or should). This should help us find problem images (or perhaps even problem templates) that the bots and other checks are missing. I don't know the scale of this problem, but I know that it exists. I guess having it checked weekly would be good. J Milburn (talk) 02:42, 27 July 2010 (UTC)

In case it wasn't clear, I was thinking about images hosted on enwp, not images hosted on Commons that happen to have some details on pages here. Can't see any harm in including the latter, but just thought I'd note that. J Milburn (talk) 02:45, 27 July 2010 (UTC)
Special:UncategorizedFiles? --MZMcBride (talk) 01:56, 28 July 2010 (UTC)
Well, I'll be damned. Thank you muchly. J Milburn (talk) 12:04, 28 July 2010 (UTC)
Interestingly, it didn't seem to catch the images I was after, which is odd. Very useful though, thanks again. J Milburn (talk) 12:14, 29 July 2010 (UTC)

Redirects

I created Wikipedia:Pages with most redirects. Don't know if it belongs on this page. Also working on Wikipedia:Templates with names differing only in capitalization. Rich Farmbrough, 13:20, 28 July 2010 (UTC).

If you plan on updating them at somewhat frequent intervals, please add them to the subject-space page. --MZMcBride (talk) 16:54, 28 July 2010 (UTC)

Interwiki reports

Would it be worth producing a report on problematic interwiki links? I have recently come across the following errors

  • Asymmetric links. i.e Article A points to article B but the en interwiki on the latter doesn't point back to article A.
  • Interwiki links to deleted articles. (These don't appear red like internal links do).
  • Interwiki links to redirects.

Now the third one may not really be a problem if we obey the rule that we don't fix redirects but the other two are definitely worth fixing. Boissière (talk) 20:47, 28 July 2010 (UTC)

There is lots of bots that fix the first error. The second one might be worth looking into, if no bot does this already. Also, I think that bot that immediately fixes such errors would be better than just a report. Svick (talk) 21:23, 28 July 2010 (UTC)
Bot requests are a bit different from database reports. Though obviously there is some overlap, as some reports are fed into bots and scripts to automatically or semi-automatically fix issues in pages.
Off-hand, I'm inclined to say that listing interwikis to redirects isn't going to be helpful and would create the illusion that they should be fixed. I don't see any reason for the redirects to bypassed unless the target page isn't accurate (the same title being re-purposed for a disambiguation page or whatever), obviously. I don't think there's really a way for a bot to know that, though.
I agree with Svick regarding points 1 and 2. Bots either exist (such as pywikipedia's interwiki.py, Google it for more info) or could exist to do some of this work. I imagine most of these bots would generate their own lists, though I'd be happy to help if I can. --MZMcBride (talk) 23:31, 28 July 2010 (UTC)
OK, I would agree that number 3 is not necessary and number 2 is fairly easy for bots to handle (though it would be interesting to know how good a job they are doing). What I can't see in the case of number 1 is how a bot could work out which link was faulty. I would have thought it would require some examination of the articles in question to decide where the links should actually go. How easy would it be to knock up a query for number 1? If it is not too hard then we could see what the current situation is. Boissière (talk) 20:07, 29 July 2010 (UTC)
I have a tool tools:~merl/reverselanglinks which is used by a gadget on dewiki to search interwikis linking to a single article. Because of these experiences i think the output of this query would be very long.
If you create this report you also have to keep in mind that jbowiki is not first-letter-upper-case like all other wikipedia. Merlissimo 20:46, 29 July 2010 (UTC)
Your comment about the output being long does imply that it is something that a bot cannot easily fix. It has occurred to me that a database report might not be the best approach as presumably you have to attach to each language wiki's own database. I am now wondering if using the API would be easier and therefore a reporting (but not fixing) bot would be better. Boissière (talk) 15:31, 2 August 2010 (UTC)

Is there any reason why that report hasn't been updated for nearly three months? Does it have anything to do with server load? I for one find that report interesting. I wonder if pending changes has had much of an effect on the data. Graham87 09:16, 29 July 2010 (UTC)

It's supposed to run on the 28th day of each month. It keeps getting killed because it runs for so long, I think. I'm not really sure if there's anything to be done about this right now. --MZMcBride (talk) 02:44, 31 July 2010 (UTC)
"data as of 18:11, 28 August 2010 (UTC)" — wheeeee! --MZMcBride (talk) 02:26, 30 August 2010 (UTC)

Archived page

I've (selectively) archived this talk page. I was going to write a script to do it, but it's not worth the development time when it only takes a few minutes to do it manually.

Some of the threads I removed were either regarding reports I didn't really feel like generating or the original poster failed to respond in a while.

If you actually want a report, re-post here and someone will take care of it, eventually. You may have to nag, though. --MZMcBride (talk) 02:41, 31 July 2010 (UTC)

births and births

Been finding a lot of biographies of dead people mistakenly put in two births categories (par exemple), could we get a report of people in two separate births categories? --Closedmouth (talk) 13:30, 14 August 2010 (UTC)

Make me! --MZMcBride (talk) 18:20, 14 August 2010 (UTC)
Here you go: http://en.wikipedia.org/w/index.php?oldid=378928452
If there isn't one already, you should create a (maintenance or talk page) category for biographies that cover more than one individual. It would allow for much easier filtering in the future, for this list and other lists. Once that's in place, I'll write up a report that regularly updates. I think this is a reasonable compromise. --MZMcBride (talk) 20:27, 14 August 2010 (UTC)
If you change the WHERE from WHERE c1.cl_to != c2.cl_to to WHERE c1.cl_to < c2.cl_to it would stop the double rows. -- WOSlinker (talk) 20:48, 14 August 2010 (UTC)
Aye, nice. :-) I think SELECT DISTINCT would also do it. I was just too lazy to re-run the query. --MZMcBride (talk) 20:49, 14 August 2010 (UTC)

File description pages containing no license

Been mulling this over for a while; similar to Wikipedia:Database reports/File description pages containing no templates, I'm looking for "File description pages containing no license" - but the actual finding of them...? Theoretically each image description page would include more than one template - i.e. Logo & FUR - which would probably lead to tons of false positives. Another option would be to scan the image page to see if any of the templates from Category:Wikipedia image copyright templates is present, but it's a huge category. Another possibility - checking for "very short" page length instead of "no" template at all... Suggestions would be greatly appreciated :) Skier Dude (talk 07:21, 19 August 2010 (UTC)

tools:~erwin85/shortpages.php seems to be what you're looking for. It allows you to find pages in the file namespace that don't transclude any templates. Killiondude (talk) 16:26, 19 August 2010 (UTC)

::Thanks!! Not only does it catch my issue - also catches the description pages for commons images as well! Skier Dude (talk 04:07, 21 August 2010 (UTC)

Changed my mind - it's picking up only the short pages - that means that if there's one template (say, the basic info) it isn't included by the tool. So, my request for "very short pages" isn't the right answer for this :( Skier Dude (talk 04:17, 21 August 2010 (UTC)
You've seen Non-free_files_missing_a_rationale (configuration), right?
If that report isn't really what you want, you should find some specific example files that are problematic. I can take a look at them and see if some patterns are evident. --MZMcBride (talk) 04:31, 21 August 2010 (UTC)
The tagged ones are in Category:Wikipedia files with unknown copyright status - I've added a goodly number there ;)Skier Dude (talk 03:29, 26 August 2010 (UTC)

Unlisted stub types

I think a useful report for WP:WikiProject Stub sorting would be all stub categories (any category ending with the word "stubs") or stub template (ending with the string "-stub") not linked to from Wikipedia:WikiProject Stub sorting/Stub types. A good heading would be:

Stub types not listed at Wikipedia:WikiProject Stub sorting/Stub types. These may be unproposed (in which case they should be listed at Wikipedia:WikiProject Stub sorting/Discoveries, or they may have simply not been listed by a user who created a proposed stub type, or renamed one under WP:Stub types for deletion.

עוד מישהו Od Mishehu 08:45, 19 August 2010 (UTC)

Shouldn't be too difficult to write. How often should it update? And I assume Wikipedia:Database reports/Unlisted stub types will work as an output location and title? --MZMcBride (talk) 03:59, 24 August 2010 (UTC)
This was actually one of the many reasons for creating {{Asbox}} - in that the List Comparer from WP:AWB can be used to compare Category:Stub message boxes with links from Wikipedia:WikiProject Stub sorting/Stub types. Unfortunately there are only about 3400 on the page and there are about 16000 in the category. It is necessary to pick up some subpages too namely
  • Wikipedia:WikiProject_Stub_sorting/Stub_types/Geography
  • Wikipedia:WikiProject_Stub_sorting/Stub_types/Culture
  • Wikipedia:WikiProject_Architecture/Stub_categories
But even then the list is far short of the category size.
Xeno has also created Wikipedia:WikiProject Stub sorting/Complete template list, last updated March.
I will paste my results into Wikipedia:Database reports/Unlisted stub types and Wikipedia:Database reports/Uncategorized stub types, presently. Rich Farmbrough, 14:27, 31 August 2010 (UTC).
Most of these stub types wil be listed at Wikipedia:WikiProject Stub sorting/Templates to vet. Rich Farmbrough, 15:11, 31 August 2010 (UTC).
Haveing remove those there's still abuut 3000 left. oI am adding them as an appenidx to Wikipedia:WikiProject Stub sorting/Templates to vet. Rich Farmbrough, 19:32, 31 August 2010 (UTC).

I see the 2 most recent lists in WSS are from March and from August (5 months apart), I think we need a more relyable system. About MZMcBride's list at Wikipedia:Database reports/Unlisted stub types - the title is good; one thing I forgot to consider is that the data is in the transcludeed subpages, which means that it takes longer for the data to get to Wikipedia:WikiProject Stub sorting/Stub types - I'd say that we should see what's not linked to from the subpages of Wikipedia:WikiProject Stub sorting/Stub types. עוד מישהו Od Mishehu 08:06, 2 September 2010 (UTC)

Replacement for Wolterbot

From WP:VPT

Is it difficult for someone to replace the late Wolterbot with a bot that goes through the FAs and coutns how many cleanup categories are also listed at the bottom to update Wikipedia:Featured articles/Cleanup listing. I cannot imagine it to be a difficult task YellowMonkey (new photo poll) 00:54, 23 August 2010 (UTC)

Maybe request a listing at WP:DBR? fetch·comms 02:36, 23 August 2010 (UTC)
Try asking at WP:BOTREQ Peachey88 (T · C) 07:22, 23 August 2010 (UTC)

The old bot was monthly, which is enough, really YellowMonkey (new photo poll) 04:12, 24 August 2010 (UTC)

I'm gonna let the WP:BOTREQ have first stab at this. If no one bites, I'll take a look. At a quick glance, it doesn't look like a bot was updating Wikipedia:Featured articles/Cleanup listing. And I'm not really sure what you define as a "cleanup category." Is there a master category somewhere of cleanup categories?
I hadn't realized B. Wolterding had stopped editing. :-/ --MZMcBride (talk) 04:18, 24 August 2010 (UTC)
Well, the it was updated by a bot until March, and when the problems were dealt with, they were struck off manually between the month. I guess if there is a parent cat, the usual ones needed are unsourced, dead links, tagged to POV/weasel, undue weight and all that YellowMonkey (new photo poll) 04:42, 24 August 2010 (UTC)
I threw this together in about 5 minutes, what do you think? Tim1357 talk 20:35, 28 August 2010 (UTC)

Report request- transclusions of deleted templates

I would like to request that a DBR be created: Pages transcluding deleted or nonexistant templates, run Sunday and Wednesday. Would be nice to catch stuff that slips through TFD/H. Note: all pages, regardless of namespace, should be listed. —Train2104 (talk · contribs · count · email) 00:25, 26 August 2010 (UTC)

I just quickly peeked at this yesterday but I was having an issue mapping this out from my mind to the database, so I left it for another day. There are a few points I wanted to make while I'm here, though:
  1. Generally I favor "red-linked" instead of "deleted or nonexistant" [sic!]; this is mostly because a template can be red-linked due to never having been created or due to having been deleted (and this can be an important distinction in the context of some reports); the generic term covers both cases fairly well, in my view;
  2. There are some oddities in some (complex or "esoteric") templates that cause them to transclude non-existent templates; this is mostly due to poor coding from what I've seen, but it doesn't change the fact that there are glaring red links in reports like Templates transcluded on the most pages (configuration) currently (e.g., Template:Pubchemcite is listed as having over 4,000 transclusions but has never existed, as far as I can see);
  3. When you talk about "templates", the reality is that a template can exist in nearly any namespace (with a few exceptions); {{User:AbusiveAdmin/Racist userbox}} is technically a template just like {{!}}; I'm going to assume you're only talking about transclusions of pages of in the Template namespace; if that's not the case, please let me know;
  4. [Related to my second point] I'm not sure you realize the volume of such a report; it will likely be thousands upon thousands of results, which leaves two options: (1) limit output to 1,000 results (truncate), or (2) paginate; I'm generally hesitant to paginate if it will be more than a few pages as I don't see that providing much value and it just adds needless clutter; Articles containing red-linked files (configuration) is a good example of this phenomenon.
I hope I haven't overwhelmed you; and I hope you've enjoyed the excessive semi-colons. --MZMcBride (talk) 02:58, 29 August 2010 (UTC)
I've found the use of Pubchemcite in {{Chembox PubChem}}. Probably should be a parameter but I haven't changed it yet. -- WOSlinker (talk) 10:16, 29 August 2010 (UTC)
If it is limited to 1000, then will it be sorted in a similar way to Templates transcluded on the most pages ? -- WOSlinker (talk) 10:22, 29 August 2010 (UTC)
It certainly could be. It'd just require pulling a COUNT(*) and sorting by it. --MZMcBride (talk) 18:00, 29 August 2010 (UTC)

I threw this together, but it seems to be pretty slow. It returned {{!comment}} as the first result. Tim1357 talk 22:24, 29 August 2010 (UTC)

The Query
SELECT tl_title
FROM   templatelinks 
       LEFT JOIN page 
         ON page_title = tl_title 
            AND page_namespace = 10 
WHERE  tl_namespace = 10 
       AND Isnull(page_id) 
LIMIT  1;

To cut down on the number of entries, lets put these restrictions on it:

  1. Transclusions of pages outside the template namespace wouldn't count.
  2. The template in question must have a deletion log entry. (reduces the esoteric issue, and plus, who would want to use a never-existent template?)

Train2104 (talkcontribscount) 23:31, 29 August 2010 (UTC)

I'm running this now. We'll see how many results it outputs (and in how much time). --MZMcBride (talk) 01:21, 30 August 2010 (UTC)
With or without Train2104's restrictions? Tim1357 talk 02:04, 30 August 2010 (UTC)

(unindent) With. The query I used was:

SELECT
  tl_title,
  COUNT(*)
FROM templatelinks
LEFT JOIN page
ON tl_namespace = page_namespace
AND tl_title = page_title
JOIN logging_ts_alternative
ON tl_namespace = log_namespace
AND tl_title = log_title
AND log_type = 'delete'
WHERE page_id IS NULL
AND tl_namespace = 10
GROUP BY tl_title
ORDER BY COUNT(*) DESC;

It took fifteen minutes to run and output 6,164 rows. A sample of the top:

Extended content
mzmcbride@bert:queries$ head -50 deleted-templates.txt
tl_title        COUNT(*)
WPBannerMeta/collapsed  381250
Spoiler 7470
Pending_deletion        4788
WPStatistics/class      1338
WPMeasure/class 930
Endspoiler      704
Highrfc-loop    618
Spoilers        505
IPAEng  390
User_Christian  333
Badbio  285
Emot    217
Grin    150
Chooserrwelcome 148
Oppose  144
CfD_doc 139
Tablabonita     138
SampleWikiProject       126
User_Chinese_reunification      126
Support 121
User_Anarcho-capitalist 121
User_admins_ignoring_policy     120
User_Christian_democrat 112
User_evol-2     112
User_en-6       109
Magazinecover   105
User_liberty    104
ISO_639_name_ar-Latn    104
Deenoewelcome   103
User_Two-state  92
Smiley  90
User_pro_concealed_carry        90
CfR_doc 89
Solarwelcome    88
User_Capitalist 84
Pokémon_species 84
User_Pro-choice 81
User_anti-gun   76
ISO_639_name_fil        76
Template_doc_page_transcluded   73
User_against_Saud       72
Kfwelcome       72
RfD_doc 71
Keep    70
Future_game     70
User_pope       70
Wpd     69
H:f     68
User_Sock_Puppet        68

And some results from the bottom of the file:

Extended content
mzmcbride@bert:queries$ tail -50 deleted-templates.txt
Beerbox_end     1
Cvg-software-stub       1
User_plays_runescape    1
List_of_people_Ba_Links 1
Placebox-location       1
Taxobox-2003_Invasion_of_Iraq   1
Actor   1
Country_alias_United_Kingdom    1
G0r     1
WikiProject_Banner_That_'70s_Show       1
Sortdate        1
Centrist_Party  1
F.C._Halifax_Town_squad 1
VfD-Disassociate        1
Infobox_City-NoFlag/Nickname    1
Micro-stub      1
User_NMDP       1
Gymnast_ru_image        1
List_of_people_Hu_Links 1
Political_party_w_logo  1
Alexwcovington/test/comment     1
Country_data_Brazil-imperial    1
G1k     1
Wikibookspage   1
Jadransport     1
NoDescribeEpisode       1
Spname  1
Chicago_portal  1
VfD-List_of_unusual_English_words       1
Usemlh. 1
Db-i9/new       1
User_smartass   1
HUNf1949        1
List_of_people_St_Links 1
Country_data_Kirovohrad_Oblast  1
User_googlechrome       1
G2i     1
Workers_Revolutionary_Party/meta/color  1
JoinWPSims      1
Note_N  1
SqlPac/WPDATABASE       1
1996_Leaders_of_the_Asia_Pacific_Economic_Co-operation_Leaders’_Forum   1
User_Chaplin    1
Cite_journal3   1
User_Warm       1
Fb_cl-non_grp_header    1
VfD-Social_surplus      1
Infobox_LEGO_Club       1
SCOTUS_1967-1969        1
User:Encyclopedist/Welcome!     1

This file is obviously sorted by number of transclusions. Train2104: Is this what you want? --MZMcBride (talk) 02:15, 30 August 2010 (UTC)

I took a look at {{Pending deletion}}, and what links here only gave me 1 transclusion (which I have removed). Where are the other 4787 transclusions? Also, the last entry is in the userspace.— Train2104 (talkcontribscount) 12:43, 30 August 2010 (UTC)
Some of the data is just bad. The templatelinks table is referencing non-existent page IDs. I imagine some old bug didn't delete the rows in the templatelinks table appropriately when a page was deleted via the user interface. That seems like a reasonable explanation for the high count for a template like Template:Pending deletion. It may be possible to account for these bad rows in future queries.
Most of the other data looks fine. All of the pages are in the Template namespace. For example: Special:WhatLinksHere/Template:User:Encyclopedist/Welcome! and Special:WhatLinksHere/Template:User_Sock_Puppet. --MZMcBride (talk) 14:33, 30 August 2010 (UTC)

Transclusions outside template namespace should be fairly few, excepting userboxen, and there's no reason they shouldn't be cleaned up (redirect to {{Tl:User box deleted}} maybe?). It would reduce the report size if you could ignore transclusions from TfD. Rich Farmbrough, 15:20, 31 August 2010 (UTC).

Looks fine to me. Paginate the first 4000 rows, truncate the rest until some are dealt with. — Train2104 (talkcontribscount) 02:26, 11 September 2010 (UTC)
The data should appear at Transclusions of deleted templates (configuration) in a few minutes. --MZMcBride (talk) 02:42, 11 September 2010 (UTC)

Medical articles frequently edited by non-autoconfirmed users

Hello. At Wikipedia_talk:Invitation_to_edit we're looking for the medical articles that are most frequently edited by non-autoconfirmed users (to trial a "how to edit a medical article" tutorial). Would it be possible to generate such a list? The top 40 should be sufficient. Anthony (talk) 17:57, 29 August 2010 (UTC)

Well, this is fairly complicated:
  1. What constitutes a "medical article"? Pages whose talk pages are tagged a particular way? Pages in a particular category? Pages from a particular list?
  2. Auto-confirmed status is implicit, not explicit, and it has changed over time. It used to be only four days; now it's four days + 10 edits. You could roughly approximate what the auto-confirmed standards of today are when looking at the database (at this specific moment in time), but you can't be exact. That is, you can say "if these people tried to edit today, we could say with a high degree of confidence that they wouldn't be able to on a semi-protected article due to their current edit count and registration date." But when you look at page histories, the stored edit count will remain current and the registration date will remain constant. So when you evaluate older edits, you won't know if the user was auto-confirmable then without manually calculating it for each edit. Other odd extensions like TorBlock completely change the auto-confirmed settings under certain conditions.
If you wanted to limit the report to something more easily definable (like all pages in Category:X sorted by number of IP edits), that would be much simpler. --MZMcBride (talk) 18:07, 29 August 2010 (UTC)

Thanks for the quick feedback! I've asked at the medicine project if there is a comprehensive list of medical articles. And IP edits is a good second best. I'll get back to you. Anthony (talk) 18:25, 29 August 2010 (UTC)

I'm back! Would it be possible to find the 100 articles with the greatest number of unique IP edits in the last year from the C, B, GA and FA class med articles here? Anthony (talk) 19:19, 29 August 2010 (UTC)

The data is at http://en.wikipedia.org/w/index.php?oldid=381806837. A few notes:

  1. it's unsorted;
  2. if there were 0 unique IP edits in the past year, the row was omitted (e.g., Timeline of tuberous sclerosis in Category:FL-Class medicine articles);
  3. the results are not truncated.

Hope that helps. The source of the script I used is below. --MZMcBride (talk) 02:18, 30 August 2010 (UTC)

Thank you so much. That was perfect. I have reduced the list to the top 20. --Anthony (talk) 09:32, 30 August 2010 (UTC)
Extended content
#! /usr/bin/env python

import MySQLdb

categories = ['FA-Class_medicine_articles', 'FL-Class_medicine_articles',
              'GA-Class_medicine_articles', 'B-Class_medicine_articles',
              'C-Class_medicine_articles']

conn = MySQLdb.connect(host='sql-s1', db='enwiki_p', read_default_file='~/.my.cnf')
cursor1 = conn.cursor()
cursor2 = conn.cursor()

for category in categories:
    cursor1.execute('''SELECT
                         page_title
                       FROM page
                       JOIN categorylinks
                       ON cl_from = page_id
                       WHERE page_namespace = 1
                       AND cl_to = %s;''' , category)
    print '\n== [[:Category:%s]] ==' % category
    print '{| class="wikitable sortable"\n! Page\n! Unique IP edits\n|-'
    for row in cursor1.fetchall():
        cursor2.execute('''SELECT
                             page_title,
                             COUNT(DISTINCT rev_user_text)
                           FROM page
                           JOIN revision
                           ON page_id = rev_page
                           WHERE page_namespace = 0
                           AND page_title = %s
                           AND rev_user = 0
                           AND rev_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 YEAR),'%%Y%%m%%d%%H%%i%%s')
                           GROUP BY rev_page;''' , row[0])
        for result in cursor2.fetchall():
             print '| [[' + result[0] + ']]\n' + '| ' + str(result[1]) + '\n|-'
    print '|}'

cursor1.close()
cursor2.close()
conn.close()

Non-free images with multiple files in the history?

Hey, would it be possible to get a report of all the files tagged as non-free which have multiple files in the history? For an example of what I mean, see File:Manchester Mark2.jpg- the image was reduced, but the old version is still visible (and will be until deleted). The old versions fail the NFCC (as they are not in use) and so some cleanup is required in every case. J Milburn (talk) 00:02, 19 September 2010 (UTC)

The first 1000 are at User:Svick/Non-free images with multiple files in the history. I can run that report on demand or regularly or extend it to more articles, if you wanted. Svick (talk) 15:25, 19 September 2010 (UTC)
Thank you very much, I'm requesting permission for my bot to deal with them, so a longer list may be very helpful- I'll see how the request for approval goes. J Milburn (talk) 22:03, 19 September 2010 (UTC)

I'd like to up the frequency of this report from monthly to weekly. Doable? --Stepheng3 (talk) 21:03, 20 September 2010 (UTC)

Sure, done. --MZMcBride (talk) 21:52, 20 September 2010 (UTC)
Thank you! --Stepheng3 (talk) 23:00, 20 September 2010 (UTC)
When can I expect another update? --Stepheng3 (talk) 16:13, 28 September 2010 (UTC)
Well, it says "0 12 * * 4" on the configuration page. That'd be 0 minutes past hour 12 on the fourth day of the week. So... maybe on Thursday? I'm not really sure why it's been over a week since it updated. I'll try to remember to peek at this later today. --MZMcBride (talk) 16:23, 28 September 2010 (UTC)
I poked at this a bit. I verified that the crontab entry is "0 12 * * 4 python ~/scripts/database-reports/coordoverlap.py > /dev/null" and that the script works properly. (I just ran it manually, so the report is now updated.) I think the report should update again this week. Please let me know if you continue to have issues. It's comparing two large sets ({{Coord/display/inline,title}} has 202,431 transclusions while {{Coord/display/title}} has 349,287 transclusions), but it shouldn't take more than a few minutes to execute and finish. I don't have any e-mail from cron indicating an issue from the script, so I guess we'll just have to wait and see if this continues to be a problem. --MZMcBride (talk) 02:39, 29 September 2010 (UTC)
Thanks. I'll let you know if it does. --Stepheng3 (talk) 03:13, 29 September 2010 (UTC)

Request for one-time run

Can we run this report, one time only? I don't need it listed on the table.

  • All files that start with "BSicon " or any case variant of it, and
  • that are not on Commons, or
  • that obscure another file with the same name on Commons?

It would help me find some bad icons, since all of them should be on commons. WP:RDT is a global project. — Train2104 (talk • contribs • count) 01:03, 6 October 2010 (UTC)

I'm not sure why you need a manual query for this. Special:PrefixIndex/File:BSicon and commons:Special:PrefixIndex/File:BSicon should do the trick (perhaps with a few set operators). --MZMcBride (talk) 01:14, 6 October 2010 (UTC)
Here's the list you want, I think: http://en.wikipedia.org/w/index.php?oldid=389000355 --MZMcBride (talk) 01:17, 6 October 2010 (UTC)
And the inverse: http://en.wikipedia.org/w/index.php?oldid=389000621 --MZMcBride (talk) 01:20, 6 October 2010 (UTC)

Uncategorized location map templates

Would it be possible to have a (one-off) report of all the uncategorised templates of the form "Template:Location map xxx"? Every time the uncategorised templates report is updated, a few are listed in the report, but I suspect there are others that aren't listed (noting that the uncategorized templates report is not a complete list), so I would like to have a complete list of all the uncategorised location map templates, so I can address them all in one go. If there's an easy way I can obtain this information myself, please advise. Thanks. DH85868993 (talk) 04:07, 9 August 2012 (UTC)

Report posted at User:DH85868993/Uncategorised location maps for you. - TB (talk) 14:45, 26 August 2012 (UTC)
Thanks! DH85868993 (talk) 01:14, 27 August 2012 (UTC)

WikiProjects ranked by liveliness

I would like a list of discussion pages of WikiProjects (so I'm talking about pages called "Wikipedia talk:WikiProject Whatever") ranked by the frequency with which they are edited---in effect the most active WikiProjects listed first. Can that be done? Michael Hardy (talk) 03:46, 24 October 2012 (UTC)

There are a couple of relevant reports already being produced. Do either of:
closely enough match your needs? - TB (talk) 07:01, 24 October 2012 (UTC)
I've spoken with Michael Hardy previously at the WikiProject Council page and recommended he come here. In trying to measure a project's activity or "liveliness", I can see the limits of the two reports you've listed. The current report on how many people are watching gives an impression that some projects are busy when in reality there has been little discussion in years (the dead WikiProject Contents tops the list). The other list provides the number of edits for articles under a project's scope, not necessarily edits made by the project's members. I think Michael Hardy is looking to see how many times the project's talk page is edited in a particular timeframe (maybe six months or a year) to gauge which projects are active places for discussion. –Mabeenot (talk) 15:44, 26 October 2012 (UTC)
Okay, initial report at User:Topbanana/WPtalk. It seems to be a good measure of activity for some projects, and an awful one for others. I suspect the parameters need tweaked to make it more useful before we make it a regular thing. - TB (talk) 17:25, 26 October 2012 (UTC)

These look as if they may be useful.

Thank you. Michael Hardy (talk) 18:29, 26 October 2012 (UTC)

Polluted categories report

I'm trying to improve Wikipedia by removing article categories from user pages per WP:USERNOCAT, and have found the Polluted categories report to be very helpful in identifying categories to fix. Although the Database reports list states that this report is updated weekly, it appears that it hasn't been updated since August. Could some kind developer please look to see how this report can be put back on schedule? Thanks! GoingBatty (talk) 02:17, 4 November 2012 (UTC)

All fixed up now. I rewrote the report, which fixed a few outstanding issues: the output limit is now 1000 instead of 250; the pipe trick has been killed; there are now some helper links for users to see which pages are polluting the category; other various fixes. Thanks for a clear and discrete problem to solve! ;-) --MZMcBride (talk) 01:19, 5 November 2012 (UTC)
Thank YOU for solving it, and so quickly too! GoingBatty (talk) 01:50, 5 November 2012 (UTC)

New report request: unsourced BLPs not BLPPRODed

A list of articles in Category:All unreferenced BLPs that don't transclude {{Prod blp/dated}}. -- Patchy1 08:39, 8 January 2013 (UTC)

I have run this report and placed the output at User talk:Patchy1/Unsourced BLPs not BLPPRODed for you. - TB (talk) 11:54, 8 January 2013 (UTC)
Excellent, thanks. -- Patchy1 12:41, 8 January 2013 (UTC)
Patchy1, you might also find http://toolserver.org/~magnus/catscan_rewrite.php usedful for that sort of query. --j⚛e deckertalk 18:34, 10 January 2013 (UTC)
Thanks! -- Patchy1 00:48, 11 January 2013 (UTC)