User talk:West.andrew.g/Popular pages/Archive 2

From Wikipedia, the free encyclopedia
Jump to: navigation, search
← Archive 1 Archive 2 Archive 3 →

Weird redirect issue for WP:5000

I've just noticed that WP:5000 (or Wikipedia:5000, which behaves identically) now redirects to Wikipedia:List of Wikipedians by number of edits, but it sure used to redirect to User:West.andrew.g/Popular pages. The strange thing is, that when you follow the redirect, and then go back by clicking on (Redirected from Wikipedia:5000), voila... You see the page which says that no, the redirect is actually still pointing to User:West.andrew.g/Popular pages. Could anyone fix this incoherence? Or redirect the problem to a proper place? --Kubanczyk (talk) 19:17, 2 August 2013 (UTC)

I did actually make a recent edit. Maybe that explains it? Best. Biosthmors (talk) 19:20, 2 August 2013 (UTC)
Is it working for you now Kubanczyk? Best. Biosthmors (talk) 19:40, 2 August 2013 (UTC)
Yes check.svg Done Works perfectly, muchas gracias. --Kubanczyk (talk) 07:19, 3 August 2013 (UTC)


Hi, I just had two questions:

  1. What page/article is "[]" (#2 on your list)?
  2. Why are so many top articles, not actually articles? They are red links, there are no articles so I don't know how nonexistent pages can be counted (or why they would chart so high).

Thanks for any answers you can provide. I use to track trends on Twitter so I'm always curious in seeing what topics are drawing interest and I'm glad I came across your page. Newjerseyliz (talk) 14:08, 13 August 2013 (UTC)

I can partially answer #2: It seems there is a Polish bot that is spamming Wikipedia for views, but it's poorly programmed and all the targets contain typographical errors, so they show up as redlinks on the list. As to what to do about it, no clue, really. Serendipodous 15:05, 13 August 2013 (UTC)
Sorry for the latent response. (1) I would bin this with your second question... (2) Whenever something is requested on the server, and whether it exists or not, this fact is recorded in the statistics aggregation. The "[]" case is likely a syntax error in the placement of wikilinks or their routing. The other "non-pages" exist prominently either due to intentional (i.e., topic spamming) or non-intentional (e.g., a mis-configured content scraper that just keeps re-attempting to download a page) reasons. However, we do assume these "red links" are the result of automated/non-human views. In some cases it can be very tricky to determine whether a popularity spike is driven broadly by society, or just a single person with a bot. It is not difficult to game these trends, though its questionable if that's a worthy use of a miscreants time. The weekly WP:Top25Report attempts to sort through this cruft and produce a witty list of what is actually trending. Thanks, West.andrew.g (talk) 14:49, 16 August 2013 (UTC)
Thank you, Serendipod and West.andrew.g, for your answers to my questions. They were very helpful.
I imagine with a project that is on the scale of Wikipedia that these charts can only be compiled by bot and that it would be laborious to go through and delete articles that are present due to gaming the system. And, while some of the redlinks and "[ ]" seem like errors, this is not always obvious.
Final question: Is it a habit to go through this list and see what pages people are looking for which don't yet exist and start a request for them at AfC? I'm not sure how much work this would involve but it could be that readers are looking for pages that don't yet exist and which should be created. Thanks again for your prompt and polite replies! NewJerseyLiz Let's Talk 14:13, 17 August 2013 (UTC)
There may have once been value in doing that, but now? With the sheer number of redlinks flooding in thanks to that Eastern European automaton, searching for a properly trending redlink is almost impossible. And let's be honest here; people are ignorant. If something is trending on Wikipedia, it's not because it's some hidden secret that the cognoscenti had been trading feverishly amongst themselves but that the rest of the internet has suddenly become aware of- it's because it's something that was already popular and well-known, and if it's popular and well-known, there is probably a Wiki page about it. Not a particularly GOOD one necessarily, but a Wiki page nonetheless. Whenever some obscure artist or small business owner suddenly gets 300,000 views, it's safer to chalk it up to spam than to genuine interest. Serendipodous 14:24, 17 August 2013 (UTC)
User:Newjerseyliz, I turned a few things blue that were red at WP:Topred this past week. Check that list out. And see the active request at the talk page. Yes there is junk there. But I find it useful! You might find it useful to submit AfC requests, but I don't know that process. User:Serendipodous, that was a bit more unhelpful of a reply than it should have been, in my opinion. So Newjerseyliz, don't mind Serendipodous. But to be fair, I do acknowledge that the red links on the WP:5000 lately rarely if ever need creating. Best wishes! Biosthmors (talk) 15:58, 17 August 2013 (UTC)
Thanks for that link, Biosthmors. As for Serendipodous, he has helped me before and having trudged through tons of data on Twitter trends, coming up with Top 100 lists for years 2009-2011, I know what it's like to filter out the gold from a lot of mud. It's too bad that spambots clutter up the statistics, I'm not sure what they really get out of this behavior but then, hey, I can live without knowing! Thanks to you both! NewJerseyLiz Let's Talk 17:08, 18 August 2013 (UTC)
I am sorry for coming across as grouchy; I thought you were referring to the redlinks in the top 25. Those are useless. But yes, I agree there is value in WP:TOPRED, though one would have to be careful to judge whether 1000 views a week constituted a genuine desire on the part of the public. Serendipodous 17:16, 18 August 2013 (UTC)
And now I doubt I struck the best tone! Thanks for your contributions Serendipodous. Biosthmors (talk) 19:43, 18 August 2013 (UTC)

"The third column is the number of page views."

- "for the week" should be spelled out. I wasn't clear and had to check one. Otherwise very useful, thanks. A calculated column with the daily average would be helpful too, as daily views are the typical measure we are used to seeing. Johnbod (talk) 13:49, 8 September 2013 (UTC)

If changes should be made, be bold in doing so! While the actual statistics are updated automatically, the header section is a static transclusion and therefore changes made by normal editors will not be overwritten at the next update. Thanks, West.andrew.g (talk) 14:37, 10 September 2013 (UTC)

Article class as a sortable field

Andrew, I was thinking this table would be more useful if people could sort by article class:

  • have you thought of moving article class to a dedicated, sortable field? Or is there any reason not to? DarTar (talk) 21:41, 27 September 2013 (UTC)
    • I have done this and the report has been regenerated. To maximize utility I gave each classification its own column and re-ordered the legend accordingly/respectively. This is a non-trivial sort operation even on my quite powerful machine. West.andrew.g (talk) 17:00, 27 November 2013 (UTC)
      • definitely non-trivial -- it's prone to crashing now on smaller machines, unfortunately. -- phoebe / (talk to me) 19:36, 8 December 2013 (UTC)
  • is a version of this table available (on labs?) as a simple API? DarTar (talk) 21:41, 27 September 2013 (UTC)
    • There is not, but I champion data sharing if there is a use-case and someone wants to facilitate the labs side of things. West.andrew.g (talk) 17:00, 27 November 2013 (UTC)
  • there's a small formatting issue with the stats section at the bottom of the page. DarTar (talk) 21:41, 27 September 2013 (UTC)
    • Is something wrong, or is it just isn't pretty? I see that the dividing "equals" signs are being eaten by section parsing. I've made a minor change to report generation. West.andrew.g (talk) 18:41, 27 November 2013 (UTC)

That would be useful! Additionally, I was trying to figure out why some articles that do have class ratings don't have an icon displayed. -- phoebe / (talk to me) 18:27, 26 November 2013 (UTC)

Can you provide any quick examples? Maybe I need to update the regexps used to detect these memberships. West.andrew.g (talk) 08:48, 27 November 2013 (UTC)
    • Well, I had some in hand when I posted this, but it looks like in this batch all of the unrated ones are redirects or truly unrated articles. So I'm not sure -- either I didn't catch that they were all redirects the first time, or something changed. Thanks again for doing this! cheers, -- phoebe / (talk to me) 19:36, 8 December 2013 (UTC)

holiday in [[Canc�n]]

Curious about the odd character that takes the place of "ú", I looked at one of the logs [1] (large file) and found this (first column is the site, where "en" is the English WIkipedia; second column is the page title; third column the number of requests, and fourth column the bytes served):

en Canc%C3%BAn 75 2499038
en Canc%C3%BAn%2C_Mexico 1 30782
en Canc%C3%BAn%2C_Quintana_Roo 2 0
en Canc%C3%BAn,_Quintana_Roo 1 30782
en Canc%C3%BAn_International_Airport 7 219001
en Canc%FAn 689 0
en Canc\xC3\xBAn 4 126709
[...] en Cancun 2 61560
en Cancun,_Mexico 3 92352
en Cancun_International_Airport 3 369228
en Cancun_Underwater_Museum 2 19440
en Cancun_airport 1 31341

The requests for Canc%FAn seem to be what make this "popular". The wiki only partly supports that encoding; putting it into a URL takes me to the intended article [2] but an attempt at making a wiki-link looks like this: [[Canc%FAn]]. Looking further in the log file, I noticed that there were no requests to other Wikipedias, or for other articles or files, with the word encoded as "Canc%FAn". Requests with the "Canc%C3%BAn" encoding had much more variation:

commons.m File:Aeropuerto_de_Canc%C3%BAn.JPG 1 11518
commons.m File:Canc%C3%BAn,_Quintana_Roo_Collage.jpg 5 79278
commons.m File:Hard_Rock_Cafe_Canc%C3%BAn.JPG 1 9196
commons.m File:Hotel_Bah%C3%ADa_Pr%C3%ADncipe-Chacumal-Estrada_federal_307_Canc%C3%BAn-Chetumal-1.jpg 1 0
de Canc%C3%BAn 3 55545
de UN-Klimakonferenz_in_Canc%C3%BAn 1 46264
en Amante_bandido_-_Miguel_Bos%C3%A9_en_Canc%C3%BAn_(acercamiento_con_binocular) 1 7153
en Aut%C3%B3dromo_de_Canc%C3%BAn 1 18823
en Canc%C3%BAn 75 2499038
en Canc%C3%BAn%2C_Mexico 1 30782
en Canc%C3%BAn%2C_Quintana_Roo 2 0
en Canc%C3%BAn,_Quintana_Roo 1 30782
en Canc%C3%BAn_International_Airport 7 219001
en Category:People_from_Canc%C3%BAn 4 32788
en File:Canc%C3%BAn%2C_Quintana_Roo_Collage.jpg 6 57540
en File:Canc%C3%BAn,_Quintana_Roo_Collage.jpg 5 47950
en Talk:Canc%C3%BAn 1 24111
es Aeropuerto_Internacional_de_Canc%C3%BAn 9 385164
es Canc%C3%BAn 51 3217033
es Estadio_Canc%C3%BAn_86 1 0
es Pioneros_de_Canc%C3%BAn 1 10636
eu Canc%C3%BAn 1 13625
fr Canc%C3%BAn 3 48714
fr Les_Marseillais_%C3%A0_Canc%C3%BAn 7 117465
hr Canc%C3%BAn 1 11223
it Aeroporto_Internazionale_di_Canc%C3%BAn 1 17328
it Canc%C3%BAn 1 17144
ko eu:Canc%C3%BAn 1 20
mr %E0%A4%9A%E0%A4%BF%E0%A4%A4%E0%A5%8D%E0%A4%B0:Sala_embarque_aeropuerto_de_Canc
%C3%BAn.JPG 1 12132
pl Canc%C3%BAn 5 186749
pt Canc%C3%BAn 16 396723
tr Canc%C3%BAn 1 12754

I noticed especially that on the Spanish Wikipedia, there were 51 requests for "Canc%C3%BAn" but none for "Canc%FAn" whereas on the English Wikipedia, there were 4 requests for "Canc\xC3\xBAn" and 689 for "Canc%FAn". —rybec 22:01, 25 December 2013 (UTC)

Interesting: -- West.andrew.g (talk) 19:26, 26 December 2013 (UTC)

similar graphs

The Web site has graphs of the traffic. For last week's list, I noticed that many of the most-requested articles about food, ecology, politics and geography had similar graphs (for Climatic Research Unit email controversy and two others, the similarity to all the others begins after a drastic increase in traffic).

rybec 10:32, 29 December 2013 (UTC)

I deliberately exclude the climate change articles' views from my reports, because I assume they artificially generated; the fact that they follow similar patterns would appear to support that. Serendipodous 11:03, 29 December 2013 (UTC)

(edit conflict) The ones that didn't match were mainly about current events or entertainment (my computer mangled some of the diacritical marks):

I think the traffic to articles in the first list is mostly automated. On 1 November, noticing a massive number of requests for Harlan Watson, I wrote the article. Several of the sources I found call the man Harlan L. Watson. Since the beginning of November, there have been over a million requests for Harlan Watson, but only 24 for Harlan L. Watson. Also striking is the fact that no one else has edited the article or its talk page. Along the same lines, I notice that:

I don't have a terrible amount to contribute here, but I will add: (a) Don't underestimate the circadian/weekly patterns in your first set of statistics. I don't see much else remarkable going on in those graphs besides that. (b) These weird spellings getting more traffic than the base article are most certainly indicative of bot traffic. This another reason to bug the analytics team to produce some aggregate metrics for us. I am beginning to suspect that WP:5000 and other sources might be dramatically over reporting/suggesting direct human traffic to articles. West.andrew.g (talk) 23:01, 29 December 2013 (UTC)
I had been looking more at the secular trend than at the weekly cycles. On a weekly scale these still look odd: I just looked at the November 2012 graph for "Denmark" [410] and it showed a regular weekly variation. That's gone in this November's [411] (with 3 times the traffic). "Greenhouse effect" shows weekly cycles in both Novembers, with traffic declining by 19%: [412] [413]; "Greenhouse gas" loses its weekly pattern and takes the same pattern as "Denmark" [414] [415], with traffic increasing to 10.5 times what it was.
Before, I had opened the graphs in different tabs in a browser, then switched between tabs. In the first group, the traffic increased steadily from 10 October until 16 November, then gradually declined through December (except for Copenhagen_treaty, Vegetarian_cuisine and Climatic_Research_Unit_email_controversy, for which the high traffic began on 7 November, then followed the same curve as the others in the eco-food group).
Some articles in the first group are getting far more requests than they did in 2012. Other, comparable articles don't show such drastic changes.

rybec 04:40, 30 December 2013 (UTC)

Significant errors present in next week's data

FYI, the WMF statistical backend malfunctioned for nearly 35 hours over 1/5 and 1/6. Notice the empty (4k) hourly files in the usual location. I don't know if this is something they can recover, but if not, it will certainly have great bearing on our next WP:5000 and its comparisons to previous editions of this list. West.andrew.g (talk) 15:30, 8 January 2014 (UTC)

Yearly summary in production

@The ed17: @Serendipodous: @Milowent: @Yaris678: -- Code is currently running to spit out a 2013 statistical summary equivalent in format to WP:5000. This is no trivial task, and I expect it to take on the order of a couple days to do the massive database join. Once it is done, I am thinking it will be a valuable and fun resource. I can also spin off a couple of tables for the "biggest hours" or "biggest days" for certain events/articles. Framed with discussion this should make a nice Signpost article, and given the success of our last attempt, I'd again like to see this pushed to Reddit, Slashdot and all the other outlets we can think of. Who is on board?

In related news, I'd like to combine these statistics, our previous discussion/analysis in the Signpost, and some novel processing towards an academic publication (a conference deadline friendly to this topic is coming in late February). I'd like to invite those who I interact with regularly here to be my co-authors in that effort. While the Signpost is great for Wikipedia folks, it would be nice to reach out to the larger web research community and perhaps get others interested the data. West.andrew.g (talk) 18:34, 2 January 2014 (UTC)

Hi. Andrew. Um, I think that what you're doing is great; the top 5000 of the year would certainly be worth a look, but the Foundation already published its annual top 100. Hope that doesn't deflate your sails too much. Serendipodous 18:58, 2 January 2014 (UTC)
I thought I remembered seeing that a week or two back, so it couldn't have truly captured *all* of 2013 (although the impact is probably minimal). Regardless, I'll spin up the top 5000 (and maybe more) and some related charts. If you'd like to do your usual snarky take on the top of the list, that might also be well received. Moreso than the raw stats, I want to get some discussion about catalysts into academically published form. West.andrew.g (talk) 19:16, 2 January 2014 (UTC)
My draft is already at the Signpost. The big problem with determining catalysts is figuring out which topics are due to human interest, which are due to error, and which are due to automated bots. We simply don't have enough tools yet. Serendipodous 20:37, 2 January 2014 (UTC)
It would be interesting to see where the nonexistent articles rank, perhaps mixed in with the ones that exist. —rybec 21:18, 2 January 2014 (UTC)
Yep, that was my intention. Once I've distilled this comprehensive table, it will be a piece of cake (at least in terms of raw code; time might be another matter), along with possibly some other data questions. Thanks, West.andrew.g (talk) 23:21, 2 January 2014 (UTC)
  • I'd be interested in pitching in. I am interested to see how one year's data of the WP:5000 has played out. Even if the Foundation has done its top 100 already, we could create something that is much more interesting. Including sublists for top movies, top TV shows, etc., that might be of interest. - it will require human effort to compile, but something like comparing a list of the top 25 movies vs. box office $$ would be interesting. Similar for TV, is Breaking Bad really the most popular TV show in the English-speaking world? Or just among wikipedia readers? And @Serendipodous:, don't get too down about the problem of ferreting out bot-influenced articles, I'd say you've done a good job of finding suspicious entries, e.g., your draft list is sound for removing G-force, these are things we wrestled over early that you now can deal with easily.--Milowenthasspoken 23:55, 2 January 2014 (UTC)

Is the update late this week? Hope you're not too overloaded. Serendipodous 16:47, 5 January 2014 (UTC)

Yes, there is a delay. I'm not overloaded, but the CPU that does all this work is. It should appear in the next 10 hours or so I estimate. West.andrew.g (talk) 00:57, 6 January 2014 (UTC)
And I need to backtrack on something I said to Rybec above. I will not be computing a full redlinks report. This is why I have not published any report yet due to computational struggles surrounding this. The long tail of non-existent pages is incredibly diverse across time. More than 50 million articles will be requested a weeks table (with only ~4.5 million actually existing). I couldn't quite compute this figure for the year, but let's say its probably on the order of 250+ million unique articles requested. When you have 250 million of these and need to join the stats data from 52 weekly tables, things quickly blow up on my infrastructure. I plan to bound this to the 4.5 million existing articles to help with this. West.andrew.g (talk) 19:50, 9 January 2014 (UTC)

Highest traffic events of 2013 (by "article hour")

Below are the busiest "article hours" in 2013. That is, those articles receiving the most traffic in a one hour period. Only the most popular hour for a title is shown, and I've excluded the main page. I've pasted the first 500 entries in raw form. Recall that these dates are in UTC time. If someone would like to wikify and extend this table, perhaps we could try to publicize a bit?

ARTICLE                  | UTC DATE          | VIEWS     | REASON
[[Jorge_Bergoglio]]      | March 13, 2013    | 1,460,586 | Papal ascension
[[Shakuntala_Devi]]      | November 4, 2013  | 766,256   | Google Doodle
[[Paul_Walker]]          | December 1, 2013  | 752,770   | Death
[[Grace_Hopper]]         | December 9, 2013  | 621,694   | Google Doodle
[[Nelson_Mandela]]       | December 5, 2013  | 484,966   | Death
[[Jodie_Foster]]         | January 14, 2013  | 451,270   | Came out at Golden Globes
[[Beyonc%C3%A9_Knowles]] | February 4, 2013  | 378,923   | Super bowl halftime
[[Nicolaus_Copernicus]]  | February 19, 2013 | 336,836   | Google Doodle
[[Seth_MacFarlane]]      | February 25, 2013 | 320,999   | Hosted the Oscars
[[Daniel_Day-Lewis]]     | February 25, 2013 | 318,839   | Oscars
[[Society_of_Jesus]]     | March 13, 2013    | 287,568   | Papal ascension
[[Mindy_McCready]]       | February 18, 2013 | 282,679   | Death
[[Hermann_Rorschach]]    | November 8, 2013  | 276,072   | Google Doodle
[[Edith_Head]]           | October 28, 2013  | 263,915   | Google Doodle
[[Raymond_Loewy]]        | November 5, 2013  | 258,301   | Google Doodle
[[Margaret_Thatcher]]    | April 8, 2013     | 252,906   | Death
[[Pope_Francis]]         | March 13, 2013    | 248,753   | Papal ascension
[[Peter_Capaldi]]        | August 4, 2013    | 244,667   | Announced as next Dr. Who

Thanks, West.andrew.g (talk) 20:30, 9 January 2014 (UTC)

The 10,000 most popular articles in 2013

After many computer cycles, the list has generated. I did the top 10k with quality annotations. Give it a while to load, as there is a ton of table processing that has to go on for that page to generate:

The top 10,000 for 2013 -- I would appreciate if people could re-post to whatever talk pages or venues might find this interesting. Thanks, West.andrew.g (talk) 17:02, 13 January 2014 (UTC)