Wikipedia talk:Database reports: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
Line 1,315: Line 1,315:
::I don't know SQL or anything about the Wikipedia database, but I have been reading up on stuff. So this is likely to be an annoying question that can be ignored, but I'll ask it anyway. Is <code>user_touched</code> a field in the database, and could that be used to check for recent activity? Thanks again for the help. - '''''[[User:Hydroxonium|<span style='color:black;background-color:yellow;'>Hydroxonium</span>]]''''' ([[User talk:Hydroxonium|<font color="black">H<sub>3</sub>O<sup>+</sup></font>]]) 08:58, 19 January 2011 (UTC)
::I don't know SQL or anything about the Wikipedia database, but I have been reading up on stuff. So this is likely to be an annoying question that can be ignored, but I'll ask it anyway. Is <code>user_touched</code> a field in the database, and could that be used to check for recent activity? Thanks again for the help. - '''''[[User:Hydroxonium|<span style='color:black;background-color:yellow;'>Hydroxonium</span>]]''''' ([[User talk:Hydroxonium|<font color="black">H<sub>3</sub>O<sup>+</sup></font>]]) 08:58, 19 January 2011 (UTC)
:If it's not too much trouble, could we exclude users whose last creation was more than, say, 90 days ago? If it's too much faff, then never mind{{mdash}}it's easy enough to check manually. [[User:HJ Mitchell|<font color="Teal" face="Tahoma">'''HJ&nbsp;Mitchell'''</font>]] &#124; [[User talk:HJ Mitchell|<font color="Navy" face= "Times New Roman">Penny for your thoughts? </font>]] 01:33, 20 January 2011 (UTC)
:If it's not too much trouble, could we exclude users whose last creation was more than, say, 90 days ago? If it's too much faff, then never mind{{mdash}}it's easy enough to check manually. [[User:HJ Mitchell|<font color="Teal" face="Tahoma">'''HJ&nbsp;Mitchell'''</font>]] &#124; [[User talk:HJ Mitchell|<font color="Navy" face= "Times New Roman">Penny for your thoughts? </font>]] 01:33, 20 January 2011 (UTC)
:::HJ Mitchel, that won't slow it down much. We could just add <code> AND MAX(rev_timestamp)>DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 90 DAY),'%Y%m%d%H%i%s')</code> to the HAVING statement. [[User:Tim1357|<font color="Blue" face="Arial" >Tim]]</font><font color="Red" face="Optima" >[[Special:Contributions/Tim1357|1357]]</font> <sup><font face="Times new roman" size = 2 >[[User talk:Tim1357|talk]]</font></sup> 04:17, 20 January 2011 (UTC)
::::MZMcBride: I'll try to run this on one of the slow servers to see if I can get any results. [[User:Tim1357|<font color="Blue" face="Arial" >Tim]]</font><font color="Red" face="Optima" >[[Special:Contributions/Tim1357|1357]]</font> <sup><font face="Times new roman" size = 2 >[[User talk:Tim1357|talk]]</font></sup> 04:17, 20 January 2011 (UTC)

Revision as of 04:17, 20 January 2011

Requests: Please list any requests for reports below in a new section. Be as specific as possible, including how often you would like the report run.

Unprotected Templates by Number of Transclusions

Hello, It would be nice to see a list of templates that are not protected, sorted by the number of transclusions they have. That way some sysops could go through and protect the higher risk templates. Tim1357 (talk) 06:54, 17 January 2010 (UTC)[reply]

Would an extra column at Templates with the most transclusions (configuration) be sufficient or would you like a separate report? --MZMcBride (talk) 04:17, 11 March 2010 (UTC)[reply]
A column would probably be ok, but if you sorted by protection status would it sub-sort by number of transclusions? –xenotalk 14:51, 23 March 2010 (UTC)[reply]
Yeah, there may actually be an advantage to creating a separate report. Not the sorting issue, necessarily, but it would drastically cut down on the noise of the new report and allow one to be purely statistical (the current report) while the other could be more production (the new report). Though I'm not a huge fan of using template protection, so I'd want to avoid any mass protections simply due to being listed.... In the meantime, Indefinitely protected templates without many transclusions (configuration) desperately needs some love. --MZMcBride (talk) 15:06, 23 March 2010 (UTC)[reply]
Gah! Can you exclude subpages of Template:Convert? Some of these are also protected because they are "frequently substituted"... Maybe a filter for that might be worthwhile as well. –xenotalk 15:22, 23 March 2010 (UTC)[reply]
Yes, I can, but I'm not really inclined to. They're sorted alphabetically, just skip over them if you'd like. Or unprotect them, as they're not widely used and their protection is silly. --MZMcBride (talk) 05:24, 25 April 2010 (UTC)[reply]
Need to pick a report name for this and come up with a limit for number of transclusions. I guess it only matters if editing is sysop-only. Nobody should care about move protection, surely. --MZMcBride (talk) 05:24, 25 April 2010 (UTC)[reply]
Done: Unprotected templates with many transclusions (configuration). --MZMcBride (talk) 03:11, 11 September 2010 (UTC)[reply]

While I don't necessarily object to this report, it may be sensible to change the cut-off point given the mess it has inadvertantly caused. Looking at the lower end of the report, do 501 transclusions really constitute "many"? PC78 (talk) 13:29, 13 September 2010 (UTC)[reply]

What limit would you suggest? --MZMcBride (talk) 19:17, 13 September 2010 (UTC)[reply]
1000? Even that may be too low, but it would be enough to cut the report (as it is now) in half. Ultimately it depends on what this report is to be used for. I'll raise the question over at the WP:ANI discussion. PC78 (talk) 00:10, 14 September 2010 (UTC)[reply]

Potentially Commons

Would it be possible to have a report that lists Files in Category:All free media which are not currently in Category:Copy to Wikimedia Commons so that they may be more readily identified?

Someone called 'erwin' on the #Wikimedia-toolserver IRC, suggested the following via a pastebin,

   1.SELECT page_title
   2. FROM page
   3. JOIN categorylinks
   4. ON cl_from = page_id
   5. WHERE page_namespace = 6
   6. AND cl_to = 'All_free_media'
   7. AND NOT EXISTS (SELECT *
   8. FROM templatelinks
   9. WHERE tl_from = page_id
  10. AND tl_namespace = 10
  11. AND tl_title IN ('Nocommons', 'Not commons')
  12. )
  13. LIMIT 10

but they pointed out that it's only a basic starting point and would need a LOT of tweaking before it's usuable for the intended purpose of finding Commons candidate material that is not currently tagged as such.

It would need to exclude Files already in Category:Copy to Wikimedia Commons, as well as images with PUI, No license or No source tags etc

Sfan00 IMG (talk) 11:31, 18 January 2010 (UTC)[reply]

Here are the first 100 results of pages in the File namespace that are not in Category:Copy to Wikimedia Commons or Category:Wikipedia files on Wikimedia Commons:
Extended content
mysql> SELECT
    ->   page_title
    -> FROM page
    -> JOIN categorylinks
    -> ON cl_from = page_id
    -> WHERE page_namespace = 6
    -> AND cl_to = 'All_free_media'
    -> AND NOT EXISTS (SELECT
    ->                   1
    ->                 FROM categorylinks AS c2
    ->                 WHERE c2.cl_from = page_id
    ->                 AND c2.cl_to = 'Copy_to_Wikimedia_Commons')
    -> AND NOT EXISTS (SELECT
    ->                   1
    ->                 FROM categorylinks AS c3
    ->                 WHERE c3.cl_from = page_id
    ->                 AND c3.cl_to = 'Wikipedia_files_on_Wikimedia_Commons')
    -> LIMIT 100;
+-----------------------------------------------------+
| page_title                                          |
+-----------------------------------------------------+
| SalamancaMarkets.JPG                                | 
| 1952_UMD_Navy_game.JPG                              | 
| Sean_Theriault.png                                  | 
| BiharDistricts.svg                                  | 
| Gulf_of_Suez_from_orbit_2007.JPG                    | 
| Irish_immigrants_1909.jpg                           | 
| Eiseleyat15.jpg                                     | 
| Stmc25logo.png                                      | 
| Panela.JPG                                          | 
| Assamfish.jpg                                       | 
| Putumayam.jpg                                       | 
| Chappati.jpg                                        | 
| LcMain2.jpg                                         | 
| The_world_flag_flying.jpg                           | 
| Lemar_image.jpg                                     | 
| MPLGelgel.gif                                       | 
| Rock_Ethos_Band.jpeg                                | 
| Stalin_Hitler_photomontage.png                      | 
| Rock_Ethos_Band_1.jpeg                              | 
| Bungee_Jumping_at_Rock_Ethos_2008.jpeg              | 
| Stage_at_Rock_Ethos_2008.jpeg                       | 
| Crowd_at_Rock_Ethos_2008.jpeg                       | 
| Leo_Griffin_1912.jpg                                | 
| CIS_logo.png                                        | 
| Rev_22_Scratch_Race.jpg                             | 
| Vectorsoliton_polarization_rotation_in_2009.jpg     | 
| Mark_Wrighton.jpg                                   | 
| Spectraliquid.jpg                                   | 
| Vectorsoliton_High-order_in_2009.jpg                | 
| Painscastle_rads.png                                | 
| Vectorsoliton_origin_in_2009.jpg                    | 
| Jefflocke.JPG                                       | 
| Aphrodite_fountain.jpg                              | 
| WADS.jpg                                            | 
| Dougflutie.JPG                                      | 
| HarthamWeirFloodStage.JPG                           | 
| W_Thomas_West.jpg                                   | 
| Inscription_Grey_Monument.jpg                       | 
| Hiram_Bell.jpg                                      | 
| Richgarces.JPG                                      | 
| Writeratworkbe.jpg                                  | 
| Newradnor_rads.png                                  | 
| Rinuccini.JPG                                       | 
| Middle_eastren_warrior.jpg                          | 
| Popcornstand.JPG                                    | 
| Paso-IceCream-AmigaVsPC.png                         | 
| Ergm_medium.gif                                     | 
| RO_licence_front.jpg                                | 
| Dwightevans.JPG                                     | 
| RO_licence_back.jpg                                 | 
| Erasers.jpg                                         | 
| F-105-354-333tfs-kc135-takhli.jpg                   | 
| F-105-4fw-mycoy-1962.jpg                            | 
| F-105-b66-takhli.jpg                                | 
| Fredlynn.JPG                                        | 
| F-105-george-nov73.jpg                              | 
| F-105D-62-4375-12thTFS-18thTFW-Kadena-18May1971.jpg | 
| Rhayader_rads.png                                   | 
| F-105G-63-8291-17wws.jpg                            | 
| F-105G-63-8319-561tfs.jpg                           | 
| Romanian_Identity_Card_2009.jpg                     | 
| F-105d-36thtfw-bit-1962.jpg                         | 
| F-105d-44tfs-1965.jpg                               | 
| F-105d-562tfs-thai.jpg                              | 
| F-105f-49tfw-spang.jpg                              | 
| Richardsonbill.JPG                                  | 
| F-105g-561tfs-mc.jpg                                | 
| F-105s-4thtfw-sj.jpg                                | 
| F-105s-george-1975.jpg                              | 
| F-105s-takhli-1964.jpg                              | 
| F102_F86D.JPG                                       | 
| Hodgenville_Lincoln_statue.jpg                      | 
| Blackwolddog.jpg                                    | 
| F000418.jpg                                         | 
| F0201002.gif                                        | 
| F0201005.gif                                        | 
| BPL-Main.jpg                                        | 
| F-89_Scorpion_Crash_IAE_Detroit_1952.jpg            | 
| F-111-27tfw-cannon.jpg                              | 
| F-111-f-4-kc135-takhli-1974.jpg                     | 
| F-111a-67-081-429tfs-takhli.jpg                     | 
| F-111f-mountainhome.jpg                             | 
| Sony_SRS-17.jpg                                     | 
| F_ertel_carlyle.jpg                                 | 
| F-117-acc.jpg                                       | 
| F-15D-79-008-525tfs-bit.jpg                         | 
| F-15D-80-082-52FW-Spang.jpg                         | 
| Knighton_rads.png                                   | 
| FAA_laser-free-zone.jpg                             | 
| F-5E-ClarkAB.jpg                                    | 
| Richardibest.jpg                                    | 
| F-82e-27fw-kearnyne.jpg                             | 
| F-82-b-29-kearneyne.jpg                             | 
| Smoothed_ripple.svg                                 | 
| TeDeumIllust.jpg                                    | 
| World_happiness.png                                 | 
| Matpandm.jpg                                        | 
| Zorah_on_the_Terrace.jpg                            | 
| Atelier_rouge_matisse_1.jpg                         | 
| Matissetoits.gif                                    | 
+-----------------------------------------------------+
100 rows in set (0.04 sec)
Please review this list and provide feedback regarding its results. Thanks. --MZMcBride (talk) 05:33, 25 April 2010 (UTC)[reply]
So far it's looking good :), I'm tagging the problems I find as well :) Sfan00 IMG (talk) 11:39, 25 April 2010 (UTC)[reply]
Is this still wanted? Is there a report name in mind? A report update frequency? --MZMcBride (talk) 19:25, 24 October 2010 (UTC)[reply]
Belated followup - This can't proceed until some other issues relating to finding images tagged for Commons that don't have

recognisable license tags are resolved. Sfan00 IMG (talk) 13:49, 5 January 2011 (UTC)[reply]

Stale report?

Wikipedia:Database reports/Indefinitely semi-protected articles hasn't been updated since late October? –xenotalk 14:34, 1 March 2010 (UTC)[reply]

Yar, it got to be too big for a single page and paginating it is an issue as it divides the content into redirects and non-redirects. This isn't the only broken report, see below. --MZMcBride (talk) 19:33, 9 March 2010 (UTC)[reply]
What about splitting the reports? (Indefinitely semiprotected redirects). –xenotalk 19:35, 9 March 2010 (UTC)[reply]
That might be an option, though it has two issues: (1) it would make the semi-protected report inconsistent with the fully protected report (though I guess that report could use some pagination too, heh), and (2) it would mean that the non-redirect report would use something clunky like "Indefinitely semi-protected non-redirects" or "Indefinitely semi-protected articles", neither of which are very nice. --MZMcBride (talk) 19:43, 9 March 2010 (UTC)[reply]
Good god you're obsessive-compulsive like me, aren't you? ;p The two reports (Indefinitely semi-protected (articles|redirects)), when listed side-by-side, should be self-explanatory. Alternatively the landing page could disambiguate the viewer. –xenotalk 19:57, 9 March 2010 (UTC)[reply]

Uncatecorized and Unreferenced BLPs

Could someone make a list of articles in Category:All unreferenced BLPs that are in no categories when one ignore all maintnance categories, Category:Living people and Cat:XXXX births. Rettetast (talk) 19:50, 22 February 2010 (UTC)[reply]

I made the above request at WP:BOTREQ a few weeks ago, but nothing happened. This is probably a better venue.
I have chosen to only do Unreferenced BLPs at the moment, but the report could be expanded to do all articles. It depends on how large the output will be. Also articles in Category:Category needed is already in a queue and does not need to be listed on this report. Rettetast (talk) 11:30, 15 March 2010 (UTC)[reply]
Example page? --MZMcBride (talk) 16:43, 15 March 2010 (UTC)[reply]
1, 2, 3, 4, 5, 6, 7, 8. Rettetast (talk) 17:27, 15 March 2010 (UTC)[reply]
Thanks for the examples. I spent about an hour working on this, but I think I need to re-think how to do this. --MZMcBride (talk) 23:40, 15 March 2010 (UTC)[reply]
Done: Uncategorized and unreferenced biographies of living people (configuration). How often do you want the report updated? --MZMcBride (talk) 21:29, 18 March 2010 (UTC)[reply]
Thank you. Once a week is enough. I'll leave a note here if I need the report too be expanded too all articles and not just the unreferenced BLPs. Rettetast (talk) 22:19, 18 March 2010 (UTC)[reply]

Could the report be expanded too all Living people? And please exclude articles that are tagged with {{catimprove}}. The latter should also be done to Wikipedia:Database reports/Potential biographies of living people (4). Rettetast (talk) 11:38, 7 April 2010 (UTC)[reply]

I'm not sure expanding this report would be accurate. If you use Category:Living people instead of Category:All unreferenced BLPs, the report title would be "Uncategorized biographies of living people," not "Uncategorized and unreferenced biographies of living people." Presumably the biographies not in Category:All unreferenced BLPs are referenced. Would a separate report work?
I'm also not sure why {{cat improve}} should be applied to Potential biographies of living people (4) (configuration). The goal of this report is to find pages that need Category:Living people. Can you elaborate? --MZMcBride (talk) 12:01, 7 April 2010 (UTC)[reply]
A separate report would be fine, but I was thinking of expanding the focus to Category:living people not moving it. Maybe a new report titled Biographies that need better categories (configuration), and also include all articles in XXXX births/deaths, would be appropriate.
The point with {{cat improve}} is that it already is in a maintenance category and is more likely to already be fixed when you get to it. Rettetast (talk) 12:24, 7 April 2010 (UTC)[reply]
Bump. Rettetast (talk) 21:03, 22 May 2010 (UTC)[reply]
Are you using Potential biographies of living people (4) (configuration) to find pages in need of better categorization or are you using it to find bios that need to be in Category:Living people? It sounds like you're using the report for the former, when its purpose is the latter. I think that's why your {{cat improve}} suggestion didn't make sense to me. --MZMcBride (talk) 00:01, 23 May 2010 (UTC)[reply]

Please get Wikipedia:Database reports/File description pages containing no templates and/or Wikipedia:Database reports/File description pages containing no templates or categories up and runnung again, and then remove the File: namespace from Wikipedia:Database reports/Blank single-author pages. There are too many files which were uploaded with empty description pages, and that has nothing to do with CSD G7. עוד מישהו Od Mishehu 10:41, 18 March 2010 (UTC)[reply]

The broken reports are noted above. I just filed another bug in Wikimedia's bug tracker (bugzilla:22883) to accompany the Toolserver bug report (jira:TS-549). I could probably rewrite these reports to use a different method of comparing against Commons, but I'd really rather see this bug resolved properly (setting the data types as varchar or varbinary consistently). I'm not sure if this is likely to happen in the near future, though, so I may be forced to rewrite these reports in the next few weeks.
As for Blank single-author pages (configuration), I don't think the File pages are really overwhelming the report, though if they're being listed elsewhere, I don't have an issue removing them. I'll do it if/when the other reports are fixed. --MZMcBride (talk) 15:05, 18 March 2010 (UTC)[reply]

Recently created pages with deleted titles

I think we could use an automatically generated list of recently created (30 days would make sense, based on Special:NewPages) pages, whose author isn't an admin, which have a deletion log entry and no undeletion entry. Pages like this are likely recreations of the original pages, which would frequently be speedy deletable (if the original was a speedy deletion, then under the same CSD; if the original was deleted per XFD, then CSD G4). Weekly seems like a good frequency for this one. עוד מישהו Od Mishehu 09:36, 14 April 2010 (UTC)[reply]

Example page? --MZMcBride (talk) 16:21, 14 April 2010 (UTC)[reply]
I think this would be a good idea, except I wouldn't necessarily limit it to those with no undeletion history. Some pages have a history of deletion/restoration only to be finally settled by an XfD resulting in delete. If such a category were recreated, filtering out ones with a restoration in history wouldn't catch this. An example of one that would fit Od Mishehu's original suggestion would be Category:Wikipedian cyclists, and an example of one that wouldn't be caught would be Category:Socialist Wikipedians. VegaDark (talk) 16:30, 14 April 2010 (UTC)[reply]

I looked at pages that were new (in the recentchanges table, which goes back about 30 days) and had a deletion entry. There were about 5,000 results total. I put 100 results below for analysis. These are only articles, but there are IP user talk pages, file pages, and other types of pages in the results as well. --MZMcBride (talk) 23:42, 14 April 2010 (UTC)[reply]

Extended content
mysql> SELECT
    ->   ns_name,
    ->   page_title
    -> FROM page
    -> JOIN toolserver.namespace
    -> ON dbname = 'enwiki_p'
    -> AND page_namespace = ns_id
    -> JOIN recentchanges
    -> ON rc_cur_id = page_id
    -> WHERE rc_new = 1
    -> AND page_is_redirect = 0
    -> AND EXISTS (SELECT
    ->               1
    ->             FROM logging_ts_alternative
    ->             WHERE log_title = page_title
    ->             AND log_namespace = page_namespace
    ->             AND log_type = 'delete'
    ->             AND log_action = 'delete')
    -> LIMIT 100;
+---------+-----------------------------------------------+
| ns_name | page_title                                    |
+---------+-----------------------------------------------+
|         | Kim_Hak-Sung                                  | 
|         | Atlassian                                     | 
|         | Philadelphia_military_academy_at_leeds        | 
|         | Payless_Cashways                              | 
|         | Big_Scoob                                     | 
|         | Michael_Stevenson                             | 
|         | Manikura_Tikinau                              | 
|         | Andreu_Lacondeguy                             | 
|         | Youth_Outreach                                | 
|         | Nate_Lowman                                   | 
|         | Tatrapan                                      | 
|         | Papa_Niang                                    | 
|         | Webberbus                                     | 
|         | Muhammad_Akram_Khan                           | 
|         | Land_of_Clover                                | 
|         | The_Rat_Look                                  | 
|         | New_Zealand_Top_50_Singles_of_2001            | 
|         | Nazneen_Contractor                            | 
|         | Nerm                                          | 
|         | EUscreen                                      | 
|         | Strong_House                                  | 
|         | L.L.Bean_SIGNATURE                            | 
|         | Michael_Barlow                                | 
|         | Sergio_Petrelli                               | 
|         | Theo_Blake                                    | 
|         | ClearTrial                                    | 
|         | Lynda                                         | 
|         | Paul_Duhart                                   | 
|         | Multidark                                     | 
|         | Break_The_Chair                               | 
|         | Chrysothrix                                   | 
|         | Ikonika                                       | 
|         | Casey_Clausen                                 | 
|         | Denodo                                        | 
|         | The_Original_Rabbit_Foot_Spasm_Band           | 
|         | Emily_Ratajkowski                             | 
|         | VMAC                                          | 
|         | Virtu-oso                                     | 
|         | Stade_Klébert_Picard                         | 
|         | Soldier_Support_Institute                     | 
|         | Vladimir_Cuk                                  | 
|         | Supermale_(novel)                             | 
|         | Uttoxeter,_Ontario                            | 
|         | Alexander_First_Nation                        | 
|         | Brothers_forever                              | 
|         | Supreme_Court_of_the_United_States,_televised | 
|         | Penelope_Gross                                | 
|         | Craig_Lowe                                    | 
|         | Ganci                                         | 
|         | Mann_(rapper)                                 | 
|         | University_of_Connecticut_Marching_Band       | 
|         | Radmila_Gorjanovic                            | 
|         | Royal_Lavanderia                              | 
|         | Mary_Wheeler                                  | 
|         | Taped_With_Rabbi_Doug                         | 
|         | Jippii                                        | 
|         | Crystal_King_(band)                           | 
|         | Globe_(album)                                 | 
|         | WhizBase                                      | 
|         | Moynat                                        | 
|         | Hailesa                                       | 
|         | Sunaila                                       | 
|         | Rainbow_Entertainment                         | 
|         | Allyn_River                                   | 
|         | Richard_Jay                                   | 
|         | Nirosha                                       | 
|         | Arve_Isdal                                    | 
|         | Lev_Kuznetsov                                 | 
|         | MindMeister                                   | 
|         | Periphery_(band)                              | 
|         | Age_Concern_Cymru_and_Help_the_Aged_in_Wales  | 
|         | Jorge_Gestoso                                 | 
|         | DJ_Swamp                                      | 
|         | Robert_Babicz                                 | 
|         | Danny_Rhymes                                  | 
|         | Cider_Bill_of_1763                            | 
|         | Andrew_Garcia                                 | 
|         | Blockdot                                      | 
|         | Merry_Tales                                   | 
|         | PMS_Asterisk                                  | 
|         | Lourdes_grotto                                | 
|         | Keith_Norman                                  | 
|         | Philipose_Vaidyar                             | 
|         | Kyle_Rudolph                                  | 
|         | Celebrity_Skin_(band)                         | 
|         | Pseudochazara                                 | 
|         | Huw_Jones                                     | 
|         | Jon_Nolan                                     | 
|         | Murray_Carter                                 | 
|         | Jonathan_Hall                                 | 
|         | Palaemonoidea                                 | 
|         | Iwebkit                                       | 
|         | The_So_So_Glos                                | 
|         | Karamoko_Koné                                | 
|         | ITVFest_(Independent_Television_Festival)     | 
|         | Eddie_Lee_Sexton                              | 
|         | The_Case_of_the_Disappearing_Diamonds         | 
|         | Google_TV                                     | 
|         | Marooned_(band)                               | 
|         | Allen_Frizzell                                | 
+---------+-----------------------------------------------+
100 rows in set (0.06 sec)

Incomplete work

--MZMcBride (talk) 15:24, 3 May 2010 (UTC)[reply]

With regard to the second report, QCoder said here that there were some false positives, but he seems to enjoy the list nonetheless. I pointed him over here and asked if he could produce some common keywords for the false positives to make the report better. Killiondude (talk) 18:09, 3 May 2010 (UTC)[reply]
Non-free... is resolved. Still need to work on (5) quite a bit. --MZMcBride (talk) 03:54, 9 May 2010 (UTC)[reply]

Categories linking to deleted categories

I think we could use a weekly report for categories linking to deleted categories. These are frequently the result of back-links to renamed categories not being fixed. עוד מישהו Od Mishehu 12:39, 23 May 2010 (UTC)[reply]

There are Categories categorized in red-linked categories (configuration) and Deleted red-linked categories (configuration) already. Not sure what you want. --MZMcBride (talk) 13:52, 23 May 2010 (UTC)[reply]
Not the same thing. I'm talking about a "See also Category:Foo", or a navbox for categories, which isn't necessarily updated when the category is moved. עוד מישהו Od Mishehu 04:43, 24 May 2010 (UTC)[reply]
Ah, so you're talking about Categories containing deleted red-linked categories, pretty much. (There's a bit of ambiguity in that report title, but oh well.) I posted a list here. The results look pretty noisy. Some results appear to be related to {{#ifexist:}} parser magic voodoo. Or perhaps just bad rows in the pagelinks table, though that's less likely. Let me know what you think of the results. --MZMcBride (talk) 05:21, 24 May 2010 (UTC)[reply]
I think we should exclude any case where the deletion reason contains the string "C1" (the CSD for empty categories), which seems to be filling up the report. עוד מישהו Od Mishehu 10:51, 26 May 2010 (UTC)[reply]
Updated results. This report idea seems untenable. --MZMcBride (talk) 22:03, 26 May 2010 (UTC)[reply]
This does contain some useable information (see this edit, where I foxed a link thenks to your report). To make this more useful, I think we should remove categories transcluding {{Progress box}} (these are maintenance categories, where the links don't need to be updated), transcluding {{cat class}} (these are WikiProject categories, where some of the corresponding levels have been deleted). עוד מישהו Od Mishehu 08:18, 27 May 2010 (UTC)[reply]
And any transcluding {{Cfr full}}, which should catch categories being proposed for renaming to previously deleted titles. עוד מישהו Od Mishehu 08:30, 27 May 2010 (UTC)[reply]

Sorry, I seem to have missed your replies here (or I forgot about them). Are you still interested in getting an updated report with the new exclusions? --MZMcBride (talk) 05:01, 13 July 2010 (UTC)[reply]

Yes. עוד מישהו Od Mishehu 10:07, 13 July 2010 (UTC)[reply]
Here are the results with the three specified templates excluded: http://en.wikipedia.org/w/index.php?oldid=373363263
By the way, you should ping me on my talk page if I don't respond to a thread for a few weeks. --MZMcBride (talk) 01:30, 14 July 2010 (UTC)[reply]
Update: I would like it if the list would:
  1. include cases where the target is a category redirect, regardless of the existrance of a deletion log for such pages;
  2. exclude all cases where the target begins with a digit, as there are too many false positives in that section to make it useful;
  3. exclude all source categories which transclude {{Articles by Quality}}, for the same reason as {{cat class}}.
עוד מישהו Od Mishehu 08:38, 29 July 2010 (UTC)[reply]
I'm confused now. This most recent request seems to have veered off of the original request. We haven't been looking at category redirects (or pseudo-redirects, as the case may be with {{category redirect}}). I'm not sure how I can modify the current query to accommodate the new request. It seems like it would have to go into a new list (with the exception of point 2, I suppose). --MZMcBride (talk) 01:37, 30 July 2010 (UTC)[reply]
Category redirects are almost always categories which have been renamed, but at lewast one user thinks that the old name is one people might be looking for. As such, from a human prospective, it's similar to the main thiong I'm looking for - links to categories which have been renamed, where the links haven't been updated. If you need to do it in a separate report, feel free to do so.
Points 2 and 3 are consistant with my earlier request - point 2 removes a practcly useless part of the list; point 3 removes a class of categories which, while technicly fit the original request, there is no need for anything to be done with. עוד מישהו Od Mishehu 07:31, 30 July 2010 (UTC)[reply]

I need to chart this, because I don't remember what the original request was or what I wrote and rewrote.

Here's what I have so far:

  • pages in the category namespace
  • that don't transclude {{progress box}}, {{cat class}}, and {{cfr full}}
  • that include links to deleted categories
  • where the deleted categories' deletion reason isn't like ".*([Cc]1|[Ee]mpty|Datadump).*"

Now you want to:

  • exclude pages beginning with a digit
  • add {{Articles by Quality}} to the list of templates to exclude based upon on pages

These two pieces are easy enough. The part about category redirects is confusing the shit out of me, so you'll need to copy and paste from this list and re-explain what exactly you're after with those. It'll have to be in a separate report, I think. --MZMcBride (talk) 02:54, 31 July 2010 (UTC)[reply]

Here you go: http://en.wikipedia.org/w/index.php?oldid=376370319 --MZMcBride (talk) 04:00, 31 July 2010 (UTC)[reply]

After an other thought, I think that links to category redirects may be a different report. What I want there is any case where:
  1. Category:A links to category:B
  2. Category:B is a category redirect (defined as containing a {{Category redirect}} template), or a catewgory disambiguation (defined as containing a {{Category ambiguous}}).
עוד מישהו Od Mishehu 10:12, 1 August 2010 (UTC)[reply]

Actually a red-link in a progress box means that the category should be created, so there's still an action required. However I seem to remember using AWB's red-links function to try and pick these up, and getting the empty red-linked categories which {{Progress box}} knows about but is not showing, as well as the ones I wanted. Rich Farmbrough, 09:22, 15 September 2010 (UTC).[reply]

usernames vs articles

Would there be any way to locate instances where both (X Y) and (user:X Y) exist, and (X Y) is a living person? DS (talk) 14:21, 7 June 2010 (UTC)[reply]

I did fuzzy matching between article creators and article titles at some point. The data is available here. It'd be neat if something similar could be done with usernames and BLP titles, though you're dealing with much larger sets, so I'm not sure how technically feasible it is.
In general, there are two important considerations to make with a task like this: false positives and false negatives. The false positives come from common names (most are), causing coincidental collisions that aren't of any importance. The false negatives come from two sources that I can think of off-hand: the addition of middle names or inclusion of full names in page titles and page titles that disambiguate (e.g. "John Smith (minister)"). The middle name issue is offset by fuzzy matching; the disambiguation issue is offset by stripping. --MZMcBride (talk) 18:25, 7 June 2010 (UTC)[reply]
If those users did not edit articles about themselves, then identifying them can raise privacy issues. Sole Soul (talk) 22:41, 7 June 2010 (UTC)[reply]
How so? --MZMcBride (talk) 00:33, 8 June 2010 (UTC)[reply]

Link to the userrights screen would be keen. Filling in the reason as 'autoconfirmed' even moreso. –xenotalk 01:28, 14 June 2010 (UTC)[reply]

I've changed the report to use {{dbr link}}, allowing any user to customize the output. Other reports will hopefully eventually use this template as well. --MZMcBride (talk) 15:28, 16 June 2010 (UTC)[reply]
Awesomeness. –xenotalk 15:31, 16 June 2010 (UTC)[reply]

User subpages for users indef blocked for spamming

There are many users that have been indefinitely blocked for spamming, for using Wikipedia for promotional purposes, or for having a promotional username. Quite often these users have subpages in their userspace that contain promotional material that hasn't been deleted. This material can show up in Google searches - which is probably why spammers put it there. Would it be technically possible to have a database report that lists these pages? If so, would others find this useful enough to be worth the effort? Peacock (talk) 17:46, 23 June 2010 (UTC)[reply]

Probably worth the effort. Do you have an example subpage (that hasn't been deleted) so that I can verify the results of any queries? --MZMcBride (talk) 17:53, 23 June 2010 (UTC)[reply]
Here's one I came across yesterday and just blanked: User:Juntaomotors/Juntaomotors. Peacock (talk) 15:43, 25 June 2010 (UTC)[reply]
That user was not blocked for spam, so it's not really a helpful example . I wrote the query below, but for some reason it does not work. Perhaps MZMcBride will be clever enough to write a better query. Tim1357 talk 01:18, 26 June 2010 (UTC)[reply]
The Non-Functioning Query
SELECT Concat('User:', page_title) 
FROM   page 
       JOIN logging
         ON log_title = Substring_index(page_title, '/', 1)
            AND log_namespace = 2
WHERE  page_namespace = 2 
       AND page_title LIKE '%/%'
       AND log_action = 'block'
       AND log_comment LIKE '%spam%'
LIMIT  1;

I assumed PCock wanted current blocks, not all block actions, so I checked against the ipblocks table. It's an insanely slow query, though. Might be better to do it in sets in a programming language. I imagine you'd want to check the ipb_reason field for a few strings like "usernamehardblocked", "spam", etc. --MZMcBride (talk) 02:10, 26 June 2010 (UTC)[reply]

Working, but very slow, query
SELECT
  page_namespace,
  page_title,
  ipb_reason
FROM page
JOIN ipblocks
ON ipb_address = TRIM(SUBSTRING_INDEX(REPLACE(page_title, '_', ' '), '/', 1))
AND page_namespace IN (2,3)
WHERE ipb_expiry = 'infinity'
AND page_title LIKE '%/%'
LIMIT 1;

Hmm. I think this is one query that is not going to be done gracefully. I ran a new query that is set to save here when it finishes. (Note each page title lacks the 'User:' prefix). I'll check back tomorrow to see if the query had any success, otherwise I am out of ideas. Tim1357 talk 02:36, 26 June 2010 (UTC)[reply]

New Query
CREATE TEMPORARY TABLE u_tim1357.blocked_users 
  (
     USER VARBINARY(255)
  );

INSERT INTO u_tim1357.blocked_users 
SELECT DISTINCT log_title 
FROM   logging 
WHERE  log_action = 'block'
       AND log_comment LIKE '%spam%'
ORDER  BY log_timestamp DESC; 

SELECT page_title 
FROM   page 
       JOIN u_tim1357.blocked_users
         ON page_title = Substring_index(USER, '/', 1)
WHERE  page_title LIKE '%/%' 
       AND page_namespace = 2;
I really hope you were using logging_ts_alternative table instead of logging when you were running those queries on the Toolserver. — Dispenser 14:56, 6 July 2010 (UTC)[reply]
Uh oh. Did I break anything? Nobody told me about that table! Tim1357 talk 23:08, 7 July 2010 (UTC)[reply]
Anyways, I bypassed the logging table all together and just used ipblocks. I limited the query to the first 200 results, which will be automatically saved here when the query completes. --Tim1357 talk 02:35, 7 August 2010 (UTC)[reply]
Guess not. --MZMcBride (talk) 05:54, 9 August 2010 (UTC)[reply]
Yea, dunno what happened there. Tim1357 talk 02:41, 17 August 2010 (UTC)[reply]

Suggestion: sort by "time added"

This is not a request, it is a suggestion concerning all applicable reports. The suggestion is to add a column which marks when an item was first added to the report.

The rationale: If an item has been reported in the last 10 updates of a daily report, chances are that more than one user has checked the item.

If this suggestion was accepted, there is no need to apply it retroactively for the current items. Sole Soul (talk) 03:19, 7 July 2010 (UTC)[reply]

I can't see any simple or clean way of doing this right now. I'll have to give this more thought. In general, I'd say that better reports are preferred to just adding a timestamp column. That is, if certain results keep showing up and are getting in the way, it'd be good to find a way to make that not happen rather than simply point out to the user how long these items have been getting in the way. Is there a particular report or set of reports where you think this feature would be useful? --MZMcBride (talk) 04:45, 13 July 2010 (UTC)[reply]
Thank you. My impression is that this is a bot function not a database query function, am I right? Sole Soul (talk) 05:51, 13 July 2010 (UTC)[reply]
I'm not sure I understand the question. The MediaWiki database doesn't track when items are added to a page. It would have to be stored in a separate table somewhere, I imagine. --MZMcBride (talk) 05:52, 13 July 2010 (UTC)[reply]
When I made my suggestion I presumed that it would be done by programming BernsteinBot to compare the last update of a report with the new update, and I just asked to know if my understanding was correct or not. Sole Soul (talk) 06:05, 13 July 2010 (UTC)[reply]
Yes, it would do something like that. BernsteinBot would probably maintain a database in which it stored database report title, a key of some sort (maybe a cryptographic hash of the table row]]), and a timestamp. Then the bot would compare any updates to the database table and update the database table or wiki table as necessary. Or something like that. It's not impossible to code, but it's not a trivial task and I'm not sure I see the benefit (yet). --MZMcBride (talk) 06:08, 13 July 2010 (UTC)[reply]
Users go through a large report either from the top down or the bottom up or randomly. With time, this will make the bulk of the top and bottom items stabilize (i.e. they will be repeated because many users have checked them). This means that small percent of items are checked many times while the rest of items checked a few times or never. Sole Soul (talk) 06:37, 13 July 2010 (UTC)[reply]

More long pages

Is it possible to get more long talk pages for Wikipedia:Database reports/Long pages in section "Specified talk pages"? It would be preciated, since the list is good to check for pages that needs to be archived and reduced in size. What about change the size limit from 175,000 bytes to 150,000 bytes? --Kslotte (talk) 16:06, 12 July 2010 (UTC)[reply]

Sure, done. --MZMcBride (talk) 04:44, 13 July 2010 (UTC)[reply]
Thanks, 300+ entries found. --Kslotte (talk) 10:58, 14 July 2010 (UTC)[reply]
Could you also exclude pages that transclude {{ Historical}}?. Tim1357 talk 10:26, 16 July 2010 (UTC)[reply]
I don't see any point why to exclude that. If it is about getting some Wikipedia talk "permanent archives" excluded, then excluding historical template alone isn't enough. There exist also many other templates. --Kslotte (talk) 10:52, 16 July 2010 (UTC)[reply]
Sounds a bit like "perfect is the enemy of done." If we can say definitively that pages marked with {{historical}} should be excluded, then there's no reason to list them, even if we may miss other pages that aren't properly marked or are marked using a different template. --MZMcBride (talk) 01:31, 18 July 2010 (UTC)[reply]

Perhaps exclude all noindexed pages. Most of the permenant archive templates dump the pages into the noindex category. Tim1357 talk 01:01, 18 July 2010 (UTC)[reply]

I don't know about this. A lot of pages get {{NOINDEX}}'d for a lot of reasons. --MZMcBride (talk) 01:31, 18 July 2010 (UTC)[reply]

The list have have been reduced. It is time to take next step and lower the size limit to 140K for Wikipedia:Database reports/Long pages in section "Specified talk pages". --Kslotte (talk) 09:42, 28 August 2010 (UTC)[reply]

And, could the {{pagelinks}} be used instead of {{pler}}? What is the idea with the raw format? do we need it? --Kslotte (talk) 22:55, 29 August 2010 (UTC)[reply]
Changed the limit per your request. I also set the report to use {{dbr link}}, so you can customize the link output to your heart's desire. --MZMcBride (talk) 01:12, 30 August 2010 (UTC)[reply]
I did this change, making use of {{pagelinks}} and an additional link to the raw version. --Kslotte (talk) 17:21, 30 August 2010 (UTC)[reply]
I did revert it. Seems like there is too many transculations on one page. Links at end of page won't work. Possible some wiki function restrictions. --Kslotte (talk) 18:31, 20 September 2010 (UTC)[reply]

The list have have been further reduced. It is time to reduce size limit to 130K for Wikipedia:Database reports/Long pages in section "Specified talk pages". --Kslotte (talk) 21:17, 21 October 2010 (UTC)[reply]

Atypical deletion log actions (configuration) has been disabled. When RevDelete was only enabled for oversighters, this report made a lot more sense. Now it has simply become unwieldy and ineffective. bugzilla:17293 would address this issue in MediaWiki itself, so write your Congressmen about getting that fixed. --MZMcBride (talk) 04:59, 13 July 2010 (UTC)[reply]

Can we get a petition going? Killiondude (talk) 19:42, 18 July 2010 (UTC)[reply]
With God, all things are possible. --MZMcBride (talk) 20:26, 18 July 2010 (UTC)[reply]

Most-linked redirects

Would it be possible to have a database of redirects that have the most links to them in articles? For example, finding out how many articles link to Dusty Rhodes (baseball player) when those should all link to Dusty Rhodes (baseball). Running it weekly would work fine. Wizardman Operation Big Bear 16:59, 17 July 2010 (UTC)[reply]

The pagelinks table is one of the largest. As a basis for comparison, all non-deleted revisions equal roughly 328,876,236 rows. The pagelinks table is currently roughly 507,145,981 rows. So any queries on it generally don't work so well. I'm testing the query to get this data now, but it's been running for a few hours and I'm not sure if it will finish before it gets killed. We'll see. --MZMcBride (talk) 20:58, 17 July 2010 (UTC)[reply]
Diddo MZMcBride. Finding things like most linked require a function that takes friggin forever to complete. This is because it needs to find how many links exist for every single redirect before it can spit out a list of the most linked. I suggest a list of redirects with over a certian ammount of links (say 500 links), limited to the first 100 found. Tim1357 talk 01:00, 18 July 2010 (UTC)[reply]
You mean ditto, not diddo. --MZMcBride (talk) 17:11, 18 July 2010 (UTC)[reply]
Tim1357 talk 18:46, 18 July 2010 (UTC)[reply]
So, I ran this query, but it took about 20 hours. And the result set wasn't ordered correctly, so I have to run it again. I'm not sure I see the need for this to be a proper database report. I may just throw it in a sandbox. Dunno. --MZMcBride (talk) 17:11, 18 July 2010 (UTC)[reply]
20 hours seems like a bit much for such a low priority type of report. Maybe just have this be a "as requested" report. Tim1357 talk 18:46, 18 July 2010 (UTC)[reply]
I tried it without any filters and after 3 hours I got a 60 MB file with 2,566,221 rows. However, I'm unsure how to filter the set. Plus we don't fix redirects. — Dispenser 23:21, 21 July 2010 (UTC)[reply]
3 hours? Can you head or tail the file and post results somewhere? I'm also not sure what "filter the set" means. Maybe paste the query you used? :P --MZMcBride (talk) 01:12, 22 July 2010 (UTC)[reply]
I looked at the query you used. I think you missed the "in articles" part. Or maybe I mis-read the request. It seems like he only wants to count links from other articles, though. --MZMcBride (talk) 11:10, 22 July 2010 (UTC)[reply]
It takes 5.2 hours searching only the article namespace, but only 15 minutes when both the page and redirect ends with ')'. However, the requester hasn't come back to clarify what he's going to do with it. — Dispenser 03:10, 29 July 2010 (UTC)[reply]

Images without categories

Would it be possible to have a request run to find images with no categories whatsoever, including categories from templates? In theory, every image should be categorised based on its licensing; if it's not, it'll fail NFCC#10b, as all the copyright tags include licensing categories (or should). This should help us find problem images (or perhaps even problem templates) that the bots and other checks are missing. I don't know the scale of this problem, but I know that it exists. I guess having it checked weekly would be good. J Milburn (talk) 02:42, 27 July 2010 (UTC)[reply]

In case it wasn't clear, I was thinking about images hosted on enwp, not images hosted on Commons that happen to have some details on pages here. Can't see any harm in including the latter, but just thought I'd note that. J Milburn (talk) 02:45, 27 July 2010 (UTC)[reply]
Special:UncategorizedFiles? --MZMcBride (talk) 01:56, 28 July 2010 (UTC)[reply]
Well, I'll be damned. Thank you muchly. J Milburn (talk) 12:04, 28 July 2010 (UTC)[reply]
Interestingly, it didn't seem to catch the images I was after, which is odd. Very useful though, thanks again. J Milburn (talk) 12:14, 29 July 2010 (UTC)[reply]

Redirects

I created Wikipedia:Pages with most redirects. Don't know if it belongs on this page. Also working on Wikipedia:Templates with names differing only in capitalization. Rich Farmbrough, 13:20, 28 July 2010 (UTC).[reply]

If you plan on updating them at somewhat frequent intervals, please add them to the subject-space page. --MZMcBride (talk) 16:54, 28 July 2010 (UTC)[reply]

Interwiki reports

Would it be worth producing a report on problematic interwiki links? I have recently come across the following errors

  • Asymmetric links. i.e Article A points to article B but the en interwiki on the latter doesn't point back to article A.
  • Interwiki links to deleted articles. (These don't appear red like internal links do).
  • Interwiki links to redirects.

Now the third one may not really be a problem if we obey the rule that we don't fix redirects but the other two are definitely worth fixing. Boissière (talk) 20:47, 28 July 2010 (UTC)[reply]

There is lots of bots that fix the first error. The second one might be worth looking into, if no bot does this already. Also, I think that bot that immediately fixes such errors would be better than just a report. Svick (talk) 21:23, 28 July 2010 (UTC)[reply]
Bot requests are a bit different from database reports. Though obviously there is some overlap, as some reports are fed into bots and scripts to automatically or semi-automatically fix issues in pages.
Off-hand, I'm inclined to say that listing interwikis to redirects isn't going to be helpful and would create the illusion that they should be fixed. I don't see any reason for the redirects to bypassed unless the target page isn't accurate (the same title being re-purposed for a disambiguation page or whatever), obviously. I don't think there's really a way for a bot to know that, though.
I agree with Svick regarding points 1 and 2. Bots either exist (such as pywikipedia's interwiki.py, Google it for more info) or could exist to do some of this work. I imagine most of these bots would generate their own lists, though I'd be happy to help if I can. --MZMcBride (talk) 23:31, 28 July 2010 (UTC)[reply]
OK, I would agree that number 3 is not necessary and number 2 is fairly easy for bots to handle (though it would be interesting to know how good a job they are doing). What I can't see in the case of number 1 is how a bot could work out which link was faulty. I would have thought it would require some examination of the articles in question to decide where the links should actually go. How easy would it be to knock up a query for number 1? If it is not too hard then we could see what the current situation is. Boissière (talk) 20:07, 29 July 2010 (UTC)[reply]
I have a tool tools:~merl/reverselanglinks which is used by a gadget on dewiki to search interwikis linking to a single article. Because of these experiences i think the output of this query would be very long.
If you create this report you also have to keep in mind that jbowiki is not first-letter-upper-case like all other wikipedia. Merlissimo 20:46, 29 July 2010 (UTC)
Your comment about the output being long does imply that it is something that a bot cannot easily fix. It has occurred to me that a database report might not be the best approach as presumably you have to attach to each language wiki's own database. I am now wondering if using the API would be easier and therefore a reporting (but not fixing) bot would be better. Boissière (talk) 15:31, 2 August 2010 (UTC)[reply]

Is there any reason why that report hasn't been updated for nearly three months? Does it have anything to do with server load? I for one find that report interesting. I wonder if pending changes has had much of an effect on the data. Graham87 09:16, 29 July 2010 (UTC)[reply]

It's supposed to run on the 28th day of each month. It keeps getting killed because it runs for so long, I think. I'm not really sure if there's anything to be done about this right now. --MZMcBride (talk) 02:44, 31 July 2010 (UTC)[reply]
"data as of 18:11, 28 August 2010 (UTC)" — wheeeee! --MZMcBride (talk) 02:26, 30 August 2010 (UTC)[reply]

Archived page

I've (selectively) archived this talk page. I was going to write a script to do it, but it's not worth the development time when it only takes a few minutes to do it manually.

Some of the threads I removed were either regarding reports I didn't really feel like generating or the original poster failed to respond in a while.

If you actually want a report, re-post here and someone will take care of it, eventually. You may have to nag, though. --MZMcBride (talk) 02:41, 31 July 2010 (UTC)[reply]

Subpages of Wikipedia:Articles for deletion not transcluded to a daily log page

To find AFD discussions not properly transcluded. Maybe limit to recently created to eliminate noise. –xenotalk 15:38, 1 August 2010 (UTC)[reply]

I assume you're already aware of Old deletion discussions (configuration) and Orphaned article deletion discussions (configuration). This is a slight variant, I guess. I'm not sure of a proper report title. I thought you were going to get a Toolserver account and start doing these tasks yourself? :P --MZMcBride (talk) 17:45, 1 August 2010 (UTC)[reply]
Untranscluded deletion discussions. As to your question... Kindasorta. But I know my limits! –xenotalk 18:21, 3 August 2010 (UTC)[reply]
"Untranscluded article deletion discussions", you mean? --MZMcBride (talk) 22:33, 3 August 2010 (UTC)[reply]
Quite. I can't do all the work, you see. [Though expanding the report to MFD discussions might be worthwhile as well]xenotalk 22:34, 3 August 2010 (UTC)[reply]
Okay, so we'll do "Untranscluded deletion discussions" and include MFD and AFD then, I guess. You want this updated daily? --MZMcBride (talk) 16:44, 14 August 2010 (UTC)[reply]

births and births

Been finding a lot of biographies of dead people mistakenly put in two births categories (par exemple), could we get a report of people in two separate births categories? --Closedmouth (talk) 13:30, 14 August 2010 (UTC)[reply]

Make me! --MZMcBride (talk) 18:20, 14 August 2010 (UTC)[reply]
Here you go: http://en.wikipedia.org/w/index.php?oldid=378928452
If there isn't one already, you should create a (maintenance or talk page) category for biographies that cover more than one individual. It would allow for much easier filtering in the future, for this list and other lists. Once that's in place, I'll write up a report that regularly updates. I think this is a reasonable compromise. --MZMcBride (talk) 20:27, 14 August 2010 (UTC)[reply]
If you change the WHERE from WHERE c1.cl_to != c2.cl_to to WHERE c1.cl_to < c2.cl_to it would stop the double rows. -- WOSlinker (talk) 20:48, 14 August 2010 (UTC)[reply]
Aye, nice. :-) I think SELECT DISTINCT would also do it. I was just too lazy to re-run the query. --MZMcBride (talk) 20:49, 14 August 2010 (UTC)[reply]

File description pages containing no license

Been mulling this over for a while; similar to Wikipedia:Database reports/File description pages containing no templates, I'm looking for "File description pages containing no license" - but the actual finding of them...? Theoretically each image description page would include more than one template - i.e. Logo & FUR - which would probably lead to tons of false positives. Another option would be to scan the image page to see if any of the templates from Category:Wikipedia image copyright templates is present, but it's a huge category. Another possibility - checking for "very short" page length instead of "no" template at all... Suggestions would be greatly appreciated :) Skier Dude (talk 07:21, 19 August 2010 (UTC)[reply]

tools:~erwin85/shortpages.php seems to be what you're looking for. It allows you to find pages in the file namespace that don't transclude any templates. Killiondude (talk) 16:26, 19 August 2010 (UTC)[reply]

::Thanks!! Not only does it catch my issue - also catches the description pages for commons images as well! Skier Dude (talk 04:07, 21 August 2010 (UTC)[reply]

Changed my mind - it's picking up only the short pages - that means that if there's one template (say, the basic info) it isn't included by the tool. So, my request for "very short pages" isn't the right answer for this :( Skier Dude (talk 04:17, 21 August 2010 (UTC)[reply]
You've seen Non-free_files_missing_a_rationale (configuration), right?
If that report isn't really what you want, you should find some specific example files that are problematic. I can take a look at them and see if some patterns are evident. --MZMcBride (talk) 04:31, 21 August 2010 (UTC)[reply]
The tagged ones are in Category:Wikipedia files with unknown copyright status - I've added a goodly number there ;)Skier Dude (talk 03:29, 26 August 2010 (UTC)[reply]

Unlisted stub types

I think a useful report for WP:WikiProject Stub sorting would be all stub categories (any category ending with the word "stubs") or stub template (ending with the string "-stub") not linked to from Wikipedia:WikiProject Stub sorting/Stub types. A good heading would be:

Stub types not listed at Wikipedia:WikiProject Stub sorting/Stub types. These may be unproposed (in which case they should be listed at Wikipedia:WikiProject Stub sorting/Discoveries, or they may have simply not been listed by a user who created a proposed stub type, or renamed one under WP:Stub types for deletion.

עוד מישהו Od Mishehu 08:45, 19 August 2010 (UTC)[reply]

Shouldn't be too difficult to write. How often should it update? And I assume Wikipedia:Database reports/Unlisted stub types will work as an output location and title? --MZMcBride (talk) 03:59, 24 August 2010 (UTC)[reply]
This was actually one of the many reasons for creating {{Asbox}} - in that the List Comparer from WP:AWB can be used to compare Category:Stub message boxes with links from Wikipedia:WikiProject Stub sorting/Stub types. Unfortunately there are only about 3400 on the page and there are about 16000 in the category. It is necessary to pick up some subpages too namely
  • Wikipedia:WikiProject_Stub_sorting/Stub_types/Geography
  • Wikipedia:WikiProject_Stub_sorting/Stub_types/Culture
  • Wikipedia:WikiProject_Architecture/Stub_categories
But even then the list is far short of the category size.
Xeno has also created Wikipedia:WikiProject Stub sorting/Complete template list, last updated March.
I will paste my results into Wikipedia:Database reports/Unlisted stub types and Wikipedia:Database reports/Uncategorized stub types, presently. Rich Farmbrough, 14:27, 31 August 2010 (UTC).[reply]
Most of these stub types wil be listed at Wikipedia:WikiProject Stub sorting/Templates to vet. Rich Farmbrough, 15:11, 31 August 2010 (UTC).[reply]
Haveing remove those there's still abuut 3000 left. oI am adding them as an appenidx to Wikipedia:WikiProject Stub sorting/Templates to vet. Rich Farmbrough, 19:32, 31 August 2010 (UTC).[reply]

I see the 2 most recent lists in WSS are from March and from August (5 months apart), I think we need a more relyable system. About MZMcBride's list at Wikipedia:Database reports/Unlisted stub types - the title is good; one thing I forgot to consider is that the data is in the transcludeed subpages, which means that it takes longer for the data to get to Wikipedia:WikiProject Stub sorting/Stub types - I'd say that we should see what's not linked to from the subpages of Wikipedia:WikiProject Stub sorting/Stub types. עוד מישהו Od Mishehu 08:06, 2 September 2010 (UTC)[reply]

Replacement for Wolterbot

From WP:VPT

Is it difficult for someone to replace the late Wolterbot with a bot that goes through the FAs and coutns how many cleanup categories are also listed at the bottom to update Wikipedia:Featured articles/Cleanup listing. I cannot imagine it to be a difficult task YellowMonkey (new photo poll) 00:54, 23 August 2010 (UTC)[reply]

Maybe request a listing at WP:DBR? fetch·comms 02:36, 23 August 2010 (UTC)[reply]
Try asking at WP:BOTREQ Peachey88 (T · C) 07:22, 23 August 2010 (UTC)[reply]

The old bot was monthly, which is enough, really YellowMonkey (new photo poll) 04:12, 24 August 2010 (UTC)[reply]

I'm gonna let the WP:BOTREQ have first stab at this. If no one bites, I'll take a look. At a quick glance, it doesn't look like a bot was updating Wikipedia:Featured articles/Cleanup listing. And I'm not really sure what you define as a "cleanup category." Is there a master category somewhere of cleanup categories?
I hadn't realized B. Wolterding had stopped editing. :-/ --MZMcBride (talk) 04:18, 24 August 2010 (UTC)[reply]
Well, the it was updated by a bot until March, and when the problems were dealt with, they were struck off manually between the month. I guess if there is a parent cat, the usual ones needed are unsourced, dead links, tagged to POV/weasel, undue weight and all that YellowMonkey (new photo poll) 04:42, 24 August 2010 (UTC)[reply]
I threw this together in about 5 minutes, what do you think? Tim1357 talk 20:35, 28 August 2010 (UTC)[reply]

Report request- transclusions of deleted templates

I would like to request that a DBR be created: Pages transcluding deleted or nonexistant templates, run Sunday and Wednesday. Would be nice to catch stuff that slips through TFD/H. Note: all pages, regardless of namespace, should be listed. —Train2104 (talk · contribs · count · email) 00:25, 26 August 2010 (UTC)[reply]

I just quickly peeked at this yesterday but I was having an issue mapping this out from my mind to the database, so I left it for another day. There are a few points I wanted to make while I'm here, though:
  1. Generally I favor "red-linked" instead of "deleted or nonexistant" [sic!]; this is mostly because a template can be red-linked due to never having been created or due to having been deleted (and this can be an important distinction in the context of some reports); the generic term covers both cases fairly well, in my view;
  2. There are some oddities in some (complex or "esoteric") templates that cause them to transclude non-existent templates; this is mostly due to poor coding from what I've seen, but it doesn't change the fact that there are glaring red links in reports like Templates transcluded on the most pages (configuration) currently (e.g., Template:Pubchemcite is listed as having over 4,000 transclusions but has never existed, as far as I can see);
  3. When you talk about "templates", the reality is that a template can exist in nearly any namespace (with a few exceptions); {{User:AbusiveAdmin/Racist userbox}} is technically a template just like {{!}}; I'm going to assume you're only talking about transclusions of pages of in the Template namespace; if that's not the case, please let me know;
  4. [Related to my second point] I'm not sure you realize the volume of such a report; it will likely be thousands upon thousands of results, which leaves two options: (1) limit output to 1,000 results (truncate), or (2) paginate; I'm generally hesitant to paginate if it will be more than a few pages as I don't see that providing much value and it just adds needless clutter; Articles containing red-linked files (configuration) is a good example of this phenomenon.
I hope I haven't overwhelmed you; and I hope you've enjoyed the excessive semi-colons. --MZMcBride (talk) 02:58, 29 August 2010 (UTC)[reply]
I've found the use of Pubchemcite in {{Chembox PubChem}}. Probably should be a parameter but I haven't changed it yet. -- WOSlinker (talk) 10:16, 29 August 2010 (UTC)[reply]
If it is limited to 1000, then will it be sorted in a similar way to Templates transcluded on the most pages ? -- WOSlinker (talk) 10:22, 29 August 2010 (UTC)[reply]
It certainly could be. It'd just require pulling a COUNT(*) and sorting by it. --MZMcBride (talk) 18:00, 29 August 2010 (UTC)[reply]

I threw this together, but it seems to be pretty slow. It returned {{!comment}} as the first result. Tim1357 talk 22:24, 29 August 2010 (UTC)[reply]

The Query
SELECT tl_title
FROM   templatelinks 
       LEFT JOIN page 
         ON page_title = tl_title 
            AND page_namespace = 10 
WHERE  tl_namespace = 10 
       AND Isnull(page_id) 
LIMIT  1;

To cut down on the number of entries, lets put these restrictions on it:

  1. Transclusions of pages outside the template namespace wouldn't count.
  2. The template in question must have a deletion log entry. (reduces the esoteric issue, and plus, who would want to use a never-existent template?)

Train2104 (talkcontribscount) 23:31, 29 August 2010 (UTC)[reply]

I'm running this now. We'll see how many results it outputs (and in how much time). --MZMcBride (talk) 01:21, 30 August 2010 (UTC)[reply]
With or without Train2104's restrictions? Tim1357 talk 02:04, 30 August 2010 (UTC)[reply]

(unindent) With. The query I used was:

SELECT
  tl_title,
  COUNT(*)
FROM templatelinks
LEFT JOIN page
ON tl_namespace = page_namespace
AND tl_title = page_title
JOIN logging_ts_alternative
ON tl_namespace = log_namespace
AND tl_title = log_title
AND log_type = 'delete'
WHERE page_id IS NULL
AND tl_namespace = 10
GROUP BY tl_title
ORDER BY COUNT(*) DESC;

It took fifteen minutes to run and output 6,164 rows. A sample of the top:

Extended content
mzmcbride@bert:queries$ head -50 deleted-templates.txt
tl_title        COUNT(*)
WPBannerMeta/collapsed  381250
Spoiler 7470
Pending_deletion        4788
WPStatistics/class      1338
WPMeasure/class 930
Endspoiler      704
Highrfc-loop    618
Spoilers        505
IPAEng  390
User_Christian  333
Badbio  285
Emot    217
Grin    150
Chooserrwelcome 148
Oppose  144
CfD_doc 139
Tablabonita     138
SampleWikiProject       126
User_Chinese_reunification      126
Support 121
User_Anarcho-capitalist 121
User_admins_ignoring_policy     120
User_Christian_democrat 112
User_evol-2     112
User_en-6       109
Magazinecover   105
User_liberty    104
ISO_639_name_ar-Latn    104
Deenoewelcome   103
User_Two-state  92
Smiley  90
User_pro_concealed_carry        90
CfR_doc 89
Solarwelcome    88
User_Capitalist 84
Pokémon_species 84
User_Pro-choice 81
User_anti-gun   76
ISO_639_name_fil        76
Template_doc_page_transcluded   73
User_against_Saud       72
Kfwelcome       72
RfD_doc 71
Keep    70
Future_game     70
User_pope       70
Wpd     69
H:f     68
User_Sock_Puppet        68

And some results from the bottom of the file:

Extended content
mzmcbride@bert:queries$ tail -50 deleted-templates.txt
Beerbox_end     1
Cvg-software-stub       1
User_plays_runescape    1
List_of_people_Ba_Links 1
Placebox-location       1
Taxobox-2003_Invasion_of_Iraq   1
Actor   1
Country_alias_United_Kingdom    1
G0r     1
WikiProject_Banner_That_'70s_Show       1
Sortdate        1
Centrist_Party  1
F.C._Halifax_Town_squad 1
VfD-Disassociate        1
Infobox_City-NoFlag/Nickname    1
Micro-stub      1
User_NMDP       1
Gymnast_ru_image        1
List_of_people_Hu_Links 1
Political_party_w_logo  1
Alexwcovington/test/comment     1
Country_data_Brazil-imperial    1
G1k     1
Wikibookspage   1
Jadransport     1
NoDescribeEpisode       1
Spname  1
Chicago_portal  1
VfD-List_of_unusual_English_words       1
Usemlh. 1
Db-i9/new       1
User_smartass   1
HUNf1949        1
List_of_people_St_Links 1
Country_data_Kirovohrad_Oblast  1
User_googlechrome       1
G2i     1
Workers_Revolutionary_Party/meta/color  1
JoinWPSims      1
Note_N  1
SqlPac/WPDATABASE       1
1996_Leaders_of_the_Asia_Pacific_Economic_Co-operation_Leaders’_Forum   1
User_Chaplin    1
Cite_journal3   1
User_Warm       1
Fb_cl-non_grp_header    1
VfD-Social_surplus      1
Infobox_LEGO_Club       1
SCOTUS_1967-1969        1
User:Encyclopedist/Welcome!     1

This file is obviously sorted by number of transclusions. Train2104: Is this what you want? --MZMcBride (talk) 02:15, 30 August 2010 (UTC)[reply]

I took a look at {{Pending deletion}}, and what links here only gave me 1 transclusion (which I have removed). Where are the other 4787 transclusions? Also, the last entry is in the userspace.— Train2104 (talkcontribscount) 12:43, 30 August 2010 (UTC)[reply]
Some of the data is just bad. The templatelinks table is referencing non-existent page IDs. I imagine some old bug didn't delete the rows in the templatelinks table appropriately when a page was deleted via the user interface. That seems like a reasonable explanation for the high count for a template like Template:Pending deletion. It may be possible to account for these bad rows in future queries.
Most of the other data looks fine. All of the pages are in the Template namespace. For example: Special:WhatLinksHere/Template:User:Encyclopedist/Welcome! and Special:WhatLinksHere/Template:User_Sock_Puppet. --MZMcBride (talk) 14:33, 30 August 2010 (UTC)[reply]

Transclusions outside template namespace should be fairly few, excepting userboxen, and there's no reason they shouldn't be cleaned up (redirect to {{Tl:User box deleted}} maybe?). It would reduce the report size if you could ignore transclusions from TfD. Rich Farmbrough, 15:20, 31 August 2010 (UTC).[reply]

Looks fine to me. Paginate the first 4000 rows, truncate the rest until some are dealt with. — Train2104 (talkcontribscount) 02:26, 11 September 2010 (UTC)[reply]
The data should appear at Transclusions of deleted templates (configuration) in a few minutes. --MZMcBride (talk) 02:42, 11 September 2010 (UTC)[reply]

Medical articles frequently edited by non-autoconfirmed users

Hello. At Wikipedia_talk:Invitation_to_edit we're looking for the medical articles that are most frequently edited by non-autoconfirmed users (to trial a "how to edit a medical article" tutorial). Would it be possible to generate such a list? The top 40 should be sufficient. Anthony (talk) 17:57, 29 August 2010 (UTC)[reply]

Well, this is fairly complicated:
  1. What constitutes a "medical article"? Pages whose talk pages are tagged a particular way? Pages in a particular category? Pages from a particular list?
  2. Auto-confirmed status is implicit, not explicit, and it has changed over time. It used to be only four days; now it's four days + 10 edits. You could roughly approximate what the auto-confirmed standards of today are when looking at the database (at this specific moment in time), but you can't be exact. That is, you can say "if these people tried to edit today, we could say with a high degree of confidence that they wouldn't be able to on a semi-protected article due to their current edit count and registration date." But when you look at page histories, the stored edit count will remain current and the registration date will remain constant. So when you evaluate older edits, you won't know if the user was auto-confirmable then without manually calculating it for each edit. Other odd extensions like TorBlock completely change the auto-confirmed settings under certain conditions.
If you wanted to limit the report to something more easily definable (like all pages in Category:X sorted by number of IP edits), that would be much simpler. --MZMcBride (talk) 18:07, 29 August 2010 (UTC)[reply]

Thanks for the quick feedback! I've asked at the medicine project if there is a comprehensive list of medical articles. And IP edits is a good second best. I'll get back to you. Anthony (talk) 18:25, 29 August 2010 (UTC)[reply]

I'm back! Would it be possible to find the 100 articles with the greatest number of unique IP edits in the last year from the C, B, GA and FA class med articles here? Anthony (talk) 19:19, 29 August 2010 (UTC)[reply]

The data is at http://en.wikipedia.org/w/index.php?oldid=381806837. A few notes:

  1. it's unsorted;
  2. if there were 0 unique IP edits in the past year, the row was omitted (e.g., Timeline of tuberous sclerosis in Category:FL-Class medicine articles);
  3. the results are not truncated.

Hope that helps. The source of the script I used is below. --MZMcBride (talk) 02:18, 30 August 2010 (UTC)[reply]

Thank you so much. That was perfect. I have reduced the list to the top 20. --Anthony (talk) 09:32, 30 August 2010 (UTC)[reply]
Extended content
#! /usr/bin/env python

import MySQLdb

categories = ['FA-Class_medicine_articles', 'FL-Class_medicine_articles',
              'GA-Class_medicine_articles', 'B-Class_medicine_articles',
              'C-Class_medicine_articles']

conn = MySQLdb.connect(host='sql-s1', db='enwiki_p', read_default_file='~/.my.cnf')
cursor1 = conn.cursor()
cursor2 = conn.cursor()

for category in categories:
    cursor1.execute('''SELECT
                         page_title
                       FROM page
                       JOIN categorylinks
                       ON cl_from = page_id
                       WHERE page_namespace = 1
                       AND cl_to = %s;''' , category)
    print '\n== [[:Category:%s]] ==' % category
    print '{| class="wikitable sortable"\n! Page\n! Unique IP edits\n|-'
    for row in cursor1.fetchall():
        cursor2.execute('''SELECT
                             page_title,
                             COUNT(DISTINCT rev_user_text)
                           FROM page
                           JOIN revision
                           ON page_id = rev_page
                           WHERE page_namespace = 0
                           AND page_title = %s
                           AND rev_user = 0
                           AND rev_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 YEAR),'%%Y%%m%%d%%H%%i%%s')
                           GROUP BY rev_page;''' , row[0])
        for result in cursor2.fetchall():
             print '| [[' + result[0] + ']]\n' + '| ' + str(result[1]) + '\n|-'
    print '|}'

cursor1.close()
cursor2.close()
conn.close()

Non-free images with multiple files in the history?

Hey, would it be possible to get a report of all the files tagged as non-free which have multiple files in the history? For an example of what I mean, see File:Manchester Mark2.jpg- the image was reduced, but the old version is still visible (and will be until deleted). The old versions fail the NFCC (as they are not in use) and so some cleanup is required in every case. J Milburn (talk) 00:02, 19 September 2010 (UTC)[reply]

The first 1000 are at User:Svick/Non-free images with multiple files in the history. I can run that report on demand or regularly or extend it to more articles, if you wanted. Svick (talk) 15:25, 19 September 2010 (UTC)[reply]
Thank you very much, I'm requesting permission for my bot to deal with them, so a longer list may be very helpful- I'll see how the request for approval goes. J Milburn (talk) 22:03, 19 September 2010 (UTC)[reply]

I'd like to up the frequency of this report from monthly to weekly. Doable? --Stepheng3 (talk) 21:03, 20 September 2010 (UTC)[reply]

Sure, done. --MZMcBride (talk) 21:52, 20 September 2010 (UTC)[reply]
Thank you! --Stepheng3 (talk) 23:00, 20 September 2010 (UTC)[reply]
When can I expect another update? --Stepheng3 (talk) 16:13, 28 September 2010 (UTC)[reply]
Well, it says "0 12 * * 4" on the configuration page. That'd be 0 minutes past hour 12 on the fourth day of the week. So... maybe on Thursday? I'm not really sure why it's been over a week since it updated. I'll try to remember to peek at this later today. --MZMcBride (talk) 16:23, 28 September 2010 (UTC)[reply]
I poked at this a bit. I verified that the crontab entry is "0 12 * * 4 python ~/scripts/database-reports/coordoverlap.py > /dev/null" and that the script works properly. (I just ran it manually, so the report is now updated.) I think the report should update again this week. Please let me know if you continue to have issues. It's comparing two large sets ({{Coord/display/inline,title}} has 202,431 transclusions while {{Coord/display/title}} has 349,287 transclusions), but it shouldn't take more than a few minutes to execute and finish. I don't have any e-mail from cron indicating an issue from the script, so I guess we'll just have to wait and see if this continues to be a problem. --MZMcBride (talk) 02:39, 29 September 2010 (UTC)[reply]
Thanks. I'll let you know if it does. --Stepheng3 (talk) 03:13, 29 September 2010 (UTC)[reply]

More to-do

At some point...

--MZMcBride (talk) 22:15, 29 September 2010 (UTC)[reply]

What's a leecher report? Svick (talk) 23:13, 29 September 2010 (UTC)[reply]
"User pages of non-contributing users" or something like that. That is, users who come here, register an account, create a user page (that's often spammy, vandalism, or a test page), and never edit anywhere else but on their user page. There are some added complications now (a lot of people land here from other Wikimedia wikis, especially with unified login now, e.g.) and there still isn't a very clear title, but I think the overall idea has merit. --MZMcBride (talk) 00:30, 30 September 2010 (UTC)[reply]
There is some merit in that sort of report, maybe not real often though. --Rocksanddirt (talk) 20:32, 22 October 2010 (UTC)[reply]

Request for one-time run

Can we run this report, one time only? I don't need it listed on the table.

  • All files that start with "BSicon " or any case variant of it, and
  • that are not on Commons, or
  • that obscure another file with the same name on Commons?

It would help me find some bad icons, since all of them should be on commons. WP:RDT is a global project. — Train2104 (talk • contribs • count) 01:03, 6 October 2010 (UTC)[reply]

I'm not sure why you need a manual query for this. Special:PrefixIndex/File:BSicon and commons:Special:PrefixIndex/File:BSicon should do the trick (perhaps with a few set operators). --MZMcBride (talk) 01:14, 6 October 2010 (UTC)[reply]
Here's the list you want, I think: http://en.wikipedia.org/w/index.php?oldid=389000355 --MZMcBride (talk) 01:17, 6 October 2010 (UTC)[reply]
And the inverse: http://en.wikipedia.org/w/index.php?oldid=389000621 --MZMcBride (talk) 01:20, 6 October 2010 (UTC)[reply]

Most-linked redirects (2)

A biweekly (or weekly if it seems useful) report I'd like to see is a list of most-linked redirects. For example, the bot would count the number of times John Blain (football player) is linked, showing the top 500 or so redirects. While some would be useful and should be left alone, others could be found that were the result of page moves that should have fixed a long time ago. Wizardman Operation Big Bear 21:23, 9 October 2010 (UTC)[reply]

Why should be redirects that are results of page moves fixed? See WP:NOTBROKEN. Svick (talk) 12:38, 10 October 2010 (UTC)[reply]
Some don't need to be, but linking to an outdated link could cause created articles to follow that nonstandard format, leading to more work for the rest of us. That and while some redirects don't need to be cleaned up, many do. Wizardman Operation Big Bear 04:15, 11 October 2010 (UTC)[reply]
Huh, you've actually asked about this before. And I've already given you the pagelinks table lecture, so now all I need to do is remember to actually write this report. :-) I'll try to get to it tomorrow. For what it's worth, I doubt it will be a weekly report. It'll likely be kind of "expensive" to generate/update. --MZMcBride (talk) 07:21, 11 October 2010 (UTC)[reply]
There is already a report called Templates containing links to disambiguation pages. Perhaps a Templates containing links to redirect pages may be useful as well? -- WOSlinker (talk) 08:49, 11 October 2010 (UTC)[reply]
To Wos: that could work. To MZM: If I did ask earlier than I forgot about that, I understand that it would definitely be a tough list to create, and limiting it to monthly given the update expense is fine. Wizardman Operation Big Bear 18:38, 11 October 2010 (UTC)[reply]
I just noticed my request several sections up. To clarify, this would just be for mainspace articles; no reason to bother with others. If you want to trim it more and remove other redirects which shouldn't be touched, then we can remove any that include the category "redirects with possibilities" which should knock time off running it. Wizardman Operation Big Bear 18:42, 11 October 2010 (UTC)[reply]

Take a look at Most-linked redirects (configuration). The initial limit was 30 incoming links. I've now raised it to 200 incoming links and it's running currently. We'll see what it outputs, how useful it is, and what (if any) adjustments are needed once it finishes.

The idea of creating another report for "Templates containing links to redirects" seems to be a good one. --MZMcBride (talk) 23:57, 11 October 2010 (UTC)[reply]

Thanks. Admittedly I was not expecting that many redirects to have such a large number of links. So far it's helped find a few issues that I've handled, even though most redirects there would be left alone. Wizardman Operation Big Bear 02:57, 12 October 2010 (UTC)[reply]
It might be helpful to see if there are ways to refine the scope of the report. This can usually be done by looking at specific examples to see if there are distinguishable characteristics about the page that can be broadly applied to other similar pages. For the example you provided (John Blain (football player)), we can say a few things:
  • the page has a move log
  • the page's target is a living person
  • the page title contains parentheses (which usually signify disambiguation of some kind)
  • the page has 7 incoming links from other articles
  • the page title signifies that the person is a living person (using a common first name)
Some of these observations are completely useless. Some aren't. If you can try to give me a better idea of what kind of pages you're looking for (perhaps with some other specific titles), I can try to make a better report for you. This might mean that the report will ultimately be moved to a different title, but that's no big deal. --MZMcBride (talk) 03:03, 12 October 2010 (UTC)[reply]
Out of those, the main one that jumps out at me would be the move log; that is something I was hoping to work on, alongside the parentheses modifier as well. The many football player disambigs, for example, stem from a pretty big move war from 07-08, so there's a lot of articles that go to the wrong places still. Wizardman Operation Big Bear 03:20, 12 October 2010 (UTC)[reply]

On this version of the report, there is an entry for Head of the River (Victoria), with the entry noting 8 files in use. Yesterday, I removed [1] more than 400 uses of non-free files from this page. The report is showing the number of files, not the number of uses. This particular article has gone largely unnoticed for three years because "8" uses is not a high number. Yet this article is certainly one of the highest, if not the highest, abusers of non-free content we've ever had on this project.

I'd like to see this report modified, or a new report created, to show how many non-free file uses there are, rather than the number of non-free files used. --Hammersoft (talk) 14:18, 20 October 2010 (UTC)[reply]

The database only tracks binary usage data. This is also true of templates, categories, page links, and external links. It might be possible to add a column to this report, but it will not be possible to get number of uses for all pages. That might result in some false negatives: if a page is using only one non-free file, but uses it a few hundred times, it won't be listed.
In order to count uses, a regex will have to be run on the page text, I think. This will need to account for templates and other such things. I have an idea of how to do this in my head, but it might be a while before I get around to it. --MZMcBride (talk) 18:52, 20 October 2010 (UTC)[reply]

Please exclude, from Wikipedia:Database reports/Stubs included directly in stub categories, any article transcluding Project:Contributor copyright investigations/Darius Dhlomo/Notice. These pages aren't then problem of WikiProject Stub Sorting; and when their problem is solved, these pages will either have been deleted or reverted to a version which probably doesn't have direct use of stub categories. עוד מישהו Od Mishehu 08:10, 21 October 2010 (UTC)[reply]

This should be updating now. --MZMcBride (talk) 19:05, 24 October 2010 (UTC)[reply]
Done, thank you. עוד מישהו Od Mishehu 08:57, 25 October 2010 (UTC)[reply]

Popular WikiProjects

User:WhatamIdoing has expressed a desire for data regarding popular WikiProjects, and those data interest me also. User talk:Mr.Z-man/Archive 12#Popular WikiProjects? has a record of the request. I would like to see a monthly report (of page views or edits or watchers; preferably one report for each of those three variables). I have added this talk page to my watchlist, and I will watch for a reply (or replies) here.
Wavelength (talk) 19:25, 27 October 2010 (UTC)[reply]

Reports of WikiProject watchers and WikiProjects by changes are now up. Svick (talk) 17:30, 13 November 2010 (UTC)[reply]
Thank you very much.—Wavelength (talk) 20:52, 13 November 2010 (UTC)[reply]

New BLPs that might be eligible for a sticky prod

Earlier this year the community introduced a new form of deletion for new unreferenced BLPs, there's a concern that we aren't tagging as many as we could. Please could we have the following regular report:

Criteria: articles in Category:All unreferenced BLPs that were created after March the 18th 2010
Format: List of articles, marked if they are already in Category:BLP articles proposed for deletion or Category:Articles for deletion
Frequency: Weekly

Many thanks ϢereSpielChequers 17:41, 28 October 2010 (UTC)[reply]

Done: Biographies of living people possibly eligible for deletion (configuration). --MZMcBride (talk) 05:48, 5 November 2010 (UTC)[reply]

Red-linked categories with significant incoming links

In light of this discussion, would it be possible to generate a report listing red-linked categories (i.e., deleted or never created) which have significant incoming links (i.e., from any namespace except User: and Wikipedia:)? Thank you, -- Black Falcon (talk) 17:31, 14 November 2010 (UTC)[reply]

  • I think Special:Wantedcategories generally updates regularly, which is why this report was never created, although that report only lists the thousand most populated categories. If someone were to go through and either bluelink or de-link the first couple hundred categories depending on their specifics, however, you would likely get everything with 2 category members or more. VegaDark (talk) 01:35, 15 November 2010 (UTC)[reply]
    • Special:WantedCategories only lists red-linked categories with actual members, doesn't it? Does it also list empty categories which have incoming links? -- Black Falcon (talk) 05:38, 15 November 2010 (UTC)[reply]
      • Ah, incoming links, for some reason I was thinking category members. Nope, guess there isn't a report for that. Although I bet there will be a ton of redlinked crap categories we've used in examples in deletion discussions over the years. VegaDark (talk) 10:06, 15 November 2010 (UTC)[reply]
        • I hadn't thought of those, but you're right: there will be a lot of those, as well as links from CfD nominations and notifications to users. I don't know if it is possible to generate a report that would ignore incoming links from the User: and Wikipedia: namespaces but, if it is possible, it would help to focus the list. -- Black Falcon (talk) 17:06, 15 November 2010 (UTC)[reply]

List of longest 'DISCUSSION' content.

Hi, Is it possible to get a list of (Wikipedia-) articles which have the largest 'Discussion' content? On these 'Talk:' pages you also see the amount of archives; for example the article about Barack Obama has 71 archives, and will end high in the list probably. Thanks in advance. Kyra —Preceding unsigned comment added by 82.171.75.225 (talk) 10:55, 17 November 2010 (UTC)[reply]

Here you go. Svick (talk) 20:40, 20 November 2010 (UTC)[reply]
Looks pretty good. Might be possible to limit to pages with "rchive" in them and then add the root page for a more precise count of actual discussion, but your method is probably about as accurate, relatively.
One thing. Perhaps it might be better to put the column in megabytes? For example, 40 230 kilobytes = 39.2871094 megabytes (according to Google); I think the megabytes figure is much easier to understand and appreciate.
Thanks for taking care of this. :-) --MZMcBride (talk) 20:59, 20 November 2010 (UTC)[reply]
I have changed it to show the size in megabytes.
I don't like including only archives the way you suggested, because it would mean that it would be the sum of sizes of the talk page including archives excluding other subpages, but including their archives (e.g. Wikipedia talk:Requests for arbitration/Badlydrawnjeff/Proposed decision/Archive 1). And some people are already confused what exactly does the number mean, so I don't want to complicate it more. Svick (talk) 23:05, 20 November 2010 (UTC)[reply]

I've taken this to MfD, as it's quickly been misinterpreted. There's nothing problematic about having long talk histories so long as the archives are orderly, so it's not clear what purpose this particular report fulfils. Chris Cunningham (user:thumperward: not at work) - talk 13:01, 21 November 2010 (UTC)[reply]

Two database reports for File redirects, please

Originally I had posted here on MZMcBride's talkpage but he has brought my request to this discussion board instead. Could someone construct two database reports for redirects in the File namespace with one or less incoming links, one for those that match Titleblacklist and another for those that don't? :| TelCoNaSpVe :| 05:23, 24 November 2010 (UTC)[reply]

Note: The Titleblacklist is a pain since matching can only be correctly done with PHP (or maybe perl). — Dispenser 05:34, 24 November 2010 (UTC)[reply]
Okay, let's ignore Titleblacklist then. Can we at least have a database report of all redirects from the file namespace with one or less incoming links stored somewhere (e.g. here)? :| TelCoNaSpVe :| 22:00, 7 December 2010 (UTC)[reply]
5803 files. That's quite long report, I think it would be worth trimming it down somehow, like report only those that have no incoming links at all? Svick (talk) 16:38, 16 January 2011 (UTC)[reply]
You're allowed to paginate. Plenty of current reports do. :P
That said, I'm not sure how good an idea this report is. What's the point of finding (internally) unused file redirects? --MZMcBride (talk) 00:54, 17 January 2011 (UTC)[reply]

The Talk pages by size MfD

The Talk pages by size MfD raised a lot of good issues and ideas. A main problem is that the database reports have very little information about them on the page, which has lead to confusion (as brought out in the MfD). Most of the database reports pages are not categorized. In response to the MfD, I modified the Wikipedia:Database reports/Talk pages by size page text using posts from the MfD. I also created Category:Wikipedia database reports as a place to categorize all the pages listed at Wikipedia:Database reports and other database reports pages whereever they may be. Initially, I though of using {{infopage}} on the top of the database pages, but that template didn't fit because the database report pages do not describe communal consensus on some aspect of Wikipedia norms and practices. I added a generic one to Talk pages by size instead. In view of the talk page tagging with the admonishment "Perhaps this will motivate greater efficiency in the use of kilobytes," you may want to create a top template to distinguish a utility report from a statistical report. I added text to the top of Wikipedia:Database reports to address this. You may further want to expand the text of Wikipedia:Database reports to provide some Database reports consensus control over the use of utility reports and statistical reports. -- Uzma Gamal (talk) 16:54, 25 November 2010 (UTC)[reply]

I populated Category:Wikipedia database reports with some of the database reports pages. If you like the category, please add the rest of the database reports pages. If not, you know what to do and there is no need to inform me of it : ). -- Uzma Gamal (talk) 17:08, 25 November 2010 (UTC)[reply]
The MFD was a complete waste of time and offered (almost) nothing of value, except to point out once again that some people posting here (on Wikipedia) are clueless. The category additions (and other additions) are going to be overwritten the next time the reports update. A better approach to the intro text of the reports is needed, but there are a lot of other projects I'd rather work on. --MZMcBride (talk) 22:49, 25 November 2010 (UTC)[reply]

Rename Long pages

Wikipedia:Database reports/Long pages should be renamed Wikipedia:Database reports/Talk pages by length consistent with Wikipedia:Database reports/Talk pages by size. "Long" is a subjective conclusion whereas Talk pages by length offers no opinion to characterize the page. -- Uzma Gamal (talk) 16:54, 25 November 2010 (UTC)[reply]

Long pages (configuration) has more than talk pages. --MZMcBride (talk) 22:46, 25 November 2010 (UTC)[reply]
"Pages by length" might work. It's up to you. Keep up the good work. -- Uzma Gamal (talk) 14:04, 26 November 2010 (UTC)[reply]

Orphaned template documentation

Would it be possible to generate a report of "/doc" subpages in the Template namespace for which the main template page does not exist or is a redirect? Thanks, -- Black Falcon (talk) 05:13, 29 November 2010 (UTC)[reply]

May even be useful to show all subpages and not just /doc. -- WOSlinker (talk) 07:59, 29 November 2010 (UTC)[reply]
I've tried it for other types of subpages, it isn't useful. There are navboxes with titles like 1984/85_X_game that are still transcluded and <noinclude> use to not work with the preload function so those templates were left uncategorized. We could improve Wikipedia:Database reports/Unused templates so that it's more useful... — Dispenser 18:31, 5 December 2010 (UTC)[reply]
/* Orphaned doc pages */
SELECT CONCAT("[[Template:", doc.page_title, "]]"),
   IF(rd_title IS NOT NULL, IF(trg.page_id, "Occupied", ""), "Delete") AS "Move",
   (SELECT COUNT(*) FROM templatelinks WHERE tl_namespace=doc.page_namespace AND tl_title=doc.page_title) AS "Transclusions",
   rd_title
FROM page AS doc
LEFT JOIN page AS tpl ON tpl.page_namespace=doc.page_namespace
    AND tpl.page_title=TRIM(TRAILING "/doc" FROM doc.page_title)
LEFT JOIN redirect ON rd_from=tpl.page_id
LEFT JOIN page AS trg ON trg.page_namespace=rd_namespace
    AND trg.page_title=CONCAT(rd_title, "/doc")

WHERE doc.page_namespace=10
AND   doc.page_title LIKE "%/doc"
AND   doc.page_is_redirect=0
AND (tpl.page_is_redirect=1 OR tpl.page_id IS NULL)

And we have 209 results with many which need to be merged/redirected to the new /doc page. — Dispenser 18:31, 5 December 2010 (UTC)[reply]

Long pages changes

The page has two edits ([2], [3]) in the history that probably has to be implemented with the bot. At same also the 140K threshold could be lowered to 130K. --Kslotte (talk) 11:14, 5 December 2010 (UTC)[reply]

Proposed change

I was wondering if Wikipedia:Database reports/Unused templates could possibly be changed so that templates which transclude {{substituted}} within them or within their /doc are either not included, or are separated into a separate section entitled "Substituted Templates" or something similar? This would make it much easier to see which templates intentionally have no transclusions (because they are supposed to be substituted) and those which are just unused. Thanks! Mhiji (talk) 18:23, 12 December 2010 (UTC)[reply]

Tag the templates with {{transclusionless}}; I'll write some code to exclude pages tagged with that template. --MZMcBride (talk) 18:52, 12 December 2010 (UTC)[reply]
Thats great thanks. Though {{transclusionless}} redirects to {{substituted}} anyway? And {{substituted}} is used much more than it's redirect. Mhiji (talk) 23:38, 12 December 2010 (UTC)[reply]
Actually, we can use categories as the bug regarding <noinclude> has been fixed. — Dispenser 02:00, 13 December 2010 (UTC)[reply]
Templates are more flexible than categories. They allow content to be included or not, they allow for easier redirects and renames, etc. Use a template. It doesn't matter if you use {{transclusionless}} or {{substituted}}. Either will work. --MZMcBride (talk) 02:15, 13 December 2010 (UTC)[reply]

Request: List of articles found in every language but English

I start an article recently and then found there were already 20+ other versions on other language Wikipedias. This got me wondering what other articles exist in multiple other languages but don't appear in English. Has anyone tried to make a list of topics which have not been translated into English, sorted by the number of other language Wikipedias it appears in? Or is anyone willing to try making such a list? Thanks in advance. —Pengo 01:41, 25 December 2010 (UTC) (Crossposted from Wikipedia:Village_pump_(idea_lab)#Every_language_but_English.3F). And if you need more specifics I'd like I'd be happy to flesh out the idea.[reply]

I worked on this for 2 hours and I have not much to show for it . There are a lot of wikipedias (269), some of which are really small (Cheyenne Wikipedia has 57 articles). Also, on the toolserver, the wikipedia databases are served across 3 separate servers, which makes querying difficult. If you (MZMcBride) want to see my work, I've included the query below. Good luck! --05:04, 28 December 2010 (UTC)
Tim1357's broken query
SELECT ff.page_title, count(ff.page_title)
FROM   (SELECT z.page_title 
        FROM   dewiki_p.page AS z 
        WHERE  z.page_namespace = 0 
        UNION ALL 
        SELECT y.page_title 
        FROM   frwiki_p.page AS y 
        WHERE  y.page_namespace = 0 
        UNION ALL 
        SELECT x.page_title 
        FROM   plwiki_p.page AS x 
        WHERE  x.page_namespace = 0 
        UNION ALL 
        SELECT w.page_title 
        FROM   itwiki_p.page AS w 
        WHERE  w.page_namespace = 0 
        UNION ALL 
        SELECT v.page_title 
        FROM   jawiki_p.page AS v 
        WHERE  v.page_namespace = 0 
        UNION ALL 
        SELECT u.page_title 
        FROM   eswiki_p.page AS u 
        WHERE  u.page_namespace = 0 
        UNION ALL 
        SELECT t.page_title 
        FROM   ptwiki_p.page AS t 
        WHERE  t.page_namespace = 0 
        UNION ALL 
        SELECT s.page_title 
        FROM   nlwiki_p.page AS s 
        WHERE  s.page_namespace = 0 
        UNION ALL 
        SELECT r.page_title 
        FROM   ruwiki_p.page AS r 
        WHERE  r.page_namespace = 0 
        UNION ALL 
        SELECT q.page_title 
        FROM   svwiki_p.page AS q 
        WHERE  q.page_namespace = 0 
        UNION ALL 
        SELECT p.page_title 
        FROM   zhwiki_p.page AS p 
        WHERE  p.page_namespace = 0 
        UNION ALL 
        SELECT o.page_title 
        FROM   cawiki_p.page AS o 
        WHERE  o.page_namespace = 0 
        UNION ALL 
        SELECT n.page_title 
        FROM   nowiki_p.page AS n 
        WHERE  n.page_namespace = 0 
        UNION ALL 
        SELECT m.page_title 
        FROM   fiwiki_p.page AS m 
        WHERE  m.page_namespace = 0 
        UNION ALL 
        SELECT l.page_title 
        FROM   ukwiki_p.page AS l 
        WHERE  l.page_namespace = 0) AS ff 
WHERE  NOT EXISTS (SELECT 1 
                   FROM   enwiki_p.page AS aa 
                   WHERE  aa.page_namespace = 0 
                          AND aa.page_title = ff.page_title) 
GROUP BY ff.page_title
LIMIT  1;
Err, aren't you assuming that each wiki would have the article with the same title? Surely other language Wikipedias will have localized titles like "World War II".... You need to be focus on the langlinks table. --MZMcBride (talk) 05:05, 28 December 2010 (UTC)[reply]
I had already posted the query at the village pump and concluded that it would not work since 1) the interwikis aren't kept in good condition and 2) other languages use different page structure. On DE Quantum physics and Quantum mechanics are separate while on EN they are covered in the same article. — Dispenser 05:45, 28 December 2010 (UTC)[reply]
I thought approaching this as “biggest interwiki networks that don't include enwiki”. But I think that would be difficult or impossible to do with just SQL. Svick (talk) 11:12, 28 December 2010 (UTC)[reply]
It has been done in the past, I think. [4] [5] [6]. Sole Soul (talk) 13:43, 28 December 2010 (UTC)[reply]
I created Spanish pond turtle yesterday for just that reason(having 7 other language pages). I think you need to look on the bigger other language Wikipedias (.es, .de) and check if it has no .en link but has many to others. Regards, SunCreator (talk) 14:36, 28 December 2010 (UTC)[reply]
Re Intrawiki links not being in good condition. My experience with The death anomalies project has been quite the opposite - intrawiki link errors do exist, but are rare. I think this sort of report would generate a useful list of articles that we potentially should have, and where instead it identifies that we are missing an intrawiki link, then that is also useful info. So yes it would be worth doing this by checking for articles on DE, FR etc that have large numbers of intrawiki links but not an EN one. Whether the result is an extra article or an added intrawiki link to an existing article doesn't matter as both are positive outcomes. ϢereSpielChequers 17:25, 28 December 2010 (UTC)[reply]


Images with no recognisable license

Would it be possible to have a 'report' based version of http://en.wikipedia.org/wiki/User:MGA73/No_license which updated daily?

Sfan00 IMG (talk) 13:52, 5 January 2011 (UTC)[reply]

I don't know what you mean. What's deficient with that subpage? How would a separate report be different? --MZMcBride (talk) 01:59, 16 January 2011 (UTC)[reply]

Images not yet suitable for Commons

Is it possible to have a report that searches for media tagged {{mtc}} (and related) but which don't have a recognisable license, or one that's incompatible with Commons (such as fair-use)?

Report to run on a dialy or weekly basis depending on volume of media concerned?

Sfan00 IMG (talk) 13:56, 5 January 2011 (UTC)[reply]

I don't know what "recognisable license" means. I also don't know which licenses are incompatible with Commons. Is there a category for such a thing? --MZMcBride (talk) 02:00, 16 January 2011 (UTC)[reply]

Editors eligible for Autopatrol privilege

Per this conversation at the Village pump, run a query very similar to tools:DBQ-87 to find editors eligible for the Autopatrol privilege.

  • Search to find all editors on en.wikipedia who have created 50+ non-redirect articles
  • If editor already has administrator or autoreview priviliges, remove from list
  • If account is less than 6 months old, remove from list
  • If last edit was made more than 30 days ago, remove from list
  • Sort list by number of articles created

Is it possible to run a query like this? Thanks very much. -      Hydroxonium (talk) 00:37, 14 January 2011 (UTC)[reply]

I came up with this, I'm not sure if it works. Tim1357 talk 02:52, 18 January 2011 (UTC)[reply]
Query
SELECT rev_user_text 
FROM   revision 
       JOIN page 
         ON page_id = rev_page 
            AND page_namespace = 0 
            AND page_is_redirect = 0 
       LEFT JOIN user_groups 
         ON ug_group IN ( 'sysop', 'autoreviewer' ) 
            AND rev_user = ug_user 
WHERE  rev_user!=0 
       AND Isnull(ug_user) 
       AND rev_timestamp > Date_format(DATE_SUB(NOW(), INTERVAL 6 MONTH), 
                           '%Y%m%d%H%i%s') 
       AND rev_parent_id = 0 
       AND rev_deleted = 0 
GROUP  BY rev_user 
HAVING COUNT(*) > 50 
LIMIT  1;

Thanks very much, Tim. I'm not familiar with SQL, so I will let others chime in. Thanks again. -      Hydroxonium (talk) 06:13, 18 January 2011 (UTC)[reply]

Can we add "bot" to the ug_group list so it looks like this?
Query
SELECT rev_user_text 
FROM   revision 
       JOIN page 
         ON page_id = rev_page 
            AND page_namespace = 0 
            AND page_is_redirect = 0 
       LEFT JOIN user_groups 
         ON ug_group IN ( 'sysop', 'autoreviewer', 'bot' ) 
            AND rev_user = ug_user 
WHERE  rev_user!=0 
       AND Isnull(ug_user) 
       AND rev_timestamp > Date_format(DATE_SUB(NOW(), INTERVAL 6 MONTH), 
                           '%Y%m%d%H%i%s') 
       AND rev_parent_id = 0 
       AND rev_deleted = 0 
GROUP  BY rev_user 
HAVING COUNT(*) > 50 
LIMIT  1;

I have been running in to bots on the list from the jira:DBQ-87 query. I don't think bots are allowed to create articles except is special cases, but I've been seeing them. Thanks again. - Hydroxonium (H3O+) 23:50, 18 January 2011 (UTC)[reply]

Another version...
SELECT
  p2.page_creator,
  usrtmp.rev_timestamp AS first_edit,
  r2.rev_timestamp AS last_edit,
  COUNT(*)
FROM u_mzmcbride_enwiki_page_creators_p.page AS p1
JOIN enwiki_p.page AS p2
ON p1.page_id = p2.page_id
JOIN (SELECT
        user_id,
        user_name,
        user_editcount,
        rev_timestamp,
        GROUP_CONCAT(ug_group) AS groups
      FROM user
      LEFT JOIN user_groups
      ON ug_user = user_id
      JOIN revision
      ON rev_user = user_id
      AND user_id NOT IN (SELECT
                            ug_user
                          FROM user_groups
                          WHERE ug_group IN ('sysop', 'autoreviewer'))
      AND rev_timestamp = (SELECT
                             MIN(rev_timestamp)
                           FROM revision
                           WHERE rev_user = user_id)
      AND rev_timestamp < DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 6 MONTH),'%Y%m%d%H%i%s')
      GROUP BY user_id) AS usrtmp
ON usrtmp.user_name = p1.page_creator
JOIN revision AS r2
ON p2.page_id = r2.rev_page
WHERE p2.page_namespace = 0
AND p2.page_is_redirect = 0
AND r2.rev_timestamp = (SELECT
                           MAX(rev_timestamp)
                         FROM revision
                         WHERE rev_user = usrtmp.user_id)
AND r2.rev_timestamp > DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 1 MONTH),'%Y%m%d%H%i%s')
GROUP BY p1.page_creator
HAVING COUNT(*) > 49
ORDER BY COUNT(*) DESC;

I maintain a database that contains page_id, page_creator pairs. So rather than querying each page, you can just look up the stored info. It's at u_mzmcbride_enwiki_page_creators_p. This query uses that database, but it got interrupted today. This really is too obnoxious to put into one query. It needs to be split out into multiple parts using some sort of sane logic. I may do this tomorrow if I get a chance. --MZMcBride (talk) 07:22, 19 January 2011 (UTC)[reply]

I actually saw that when I was writing the query but didn't use it because the table is indexed so that it is optimized for pageid->creator, instead of the other way around. This means that MySQL has to do a table scan in order to find each page that the user created. Maybe you could create an index on top of the table to help with this? CREATE INDEX user_creations ON u_mzmcbride_enwiki_page_creators_p.page (page_creator); or something would do the trick. Tim1357 talk 04:01, 20 January 2011 (UTC)[reply]
I don't know what you're talking about. The schema looks like this:
CREATE DATABASE u_mzmcbride_enwiki_page_creators_p;
CREATE TABLE page (
page_id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
page_creator varchar(255) binary NOT NULL default ''
);
CREATE INDEX page_creator ON page (page_creator);
I don't really have the time to do this right now, but I think the best method is to do a query to get all the page creators with over fifty article creations (should take about half an hour), then do a Pythonic set comparison with the bots/autopatrollers/sysops and then check last edit/first edit for each user using functions/individual queries. That's how I mapped it out in my head, at least. I probably won't get around to this for a while, though. --MZMcBride (talk) 04:09, 20 January 2011 (UTC)[reply]


Thanks very much for the help, MZMcBride. Yes, please split up the query if that will ease the task. If it ends up being too obnoxious, I would be fine just running the original jira:DBQ-87 query again. Then I can manually go through the list and check for the other things (6 months old, activity in the last 30 days, etc.). That's what I have been doing with the old list. It's just extremely labor intensive. I've spent over 20 hours manually sifting through that old list and I'm starting to go bonkers. Ughhh. Anyway, thanks again. I appreciate the help.- Hydroxonium (H3O+) 08:27, 19 January 2011 (UTC)[reply]
I don't know SQL or anything about the Wikipedia database, but I have been reading up on stuff. So this is likely to be an annoying question that can be ignored, but I'll ask it anyway. Is user_touched a field in the database, and could that be used to check for recent activity? Thanks again for the help. - Hydroxonium (H3O+) 08:58, 19 January 2011 (UTC)[reply]
If it's not too much trouble, could we exclude users whose last creation was more than, say, 90 days ago? If it's too much faff, then never mind—it's easy enough to check manually. HJ Mitchell | Penny for your thoughts? 01:33, 20 January 2011 (UTC)[reply]
HJ Mitchel, that won't slow it down much. We could just add AND MAX(rev_timestamp)>DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 90 DAY),'%Y%m%d%H%i%s') to the HAVING statement. Tim1357 talk 04:17, 20 January 2011 (UTC)[reply]
MZMcBride: I'll try to run this on one of the slow servers to see if I can get any results. Tim1357 talk 04:17, 20 January 2011 (UTC)[reply]