User:Killiondude/stats
Frequently Asked Questions This page serves to document frequently asked questions regarding Henrik's Wikipedia article traffic statistics tool. |
Is it case sensitive?
[edit]No.
Are redirects included in the data for a specific article?
[edit]No. One would need to look up each redirect's hit statistics.
How can I find out the top viewed pages for any given project?
[edit]View a statistics page for any article on the desired project. Then change the URL manually to replace the date and article name with the term top. Example: http://stats.grok.se/en/200912/Special:Search → http://stats.grok.se/en/top
Note that this information is not updated on a regular schedule, and was not at all for a long time (until April 2013, since 2010). It is performed by Henrik (at least somewhat) manually and probably requires much resources.
Try tools:~johang/wikitrends or tools:~johang/2012.html.
How do I see stats for this month? A link doesn't show up!
[edit]You can change the URL manually to this month's numerical code (January = 01, February = 02, and so on). An example would be http://stats.grok.se/en/201004/Tree where 2010 is the year and 04 is the month of April.
How do I see stats for the past 30 days?
[edit]Use the format http://stats.grok.se/en/latest/Tree which will always be the current previous 30 days; there are also latest30, latest60 and latest90, now also linked in the interface.
How often are the stats updated?
[edit]Once per day, usually soon after 0:00 UTC.
Where is the data previous to October 2009 located?
[edit]There is a set uploaded to the Internet Archive located here.
Is the pageview data available in any other data format?
[edit]Yes.
You can see JSON formatted data by prepending /json/ to the URL like so: http://stats.grok.se/json/en/200910/Michael_Jackson.
Raw data is available:
- in the original (source) data at http://dumps.wikimedia.org/other/pagecounts-raw/ (announcement of movement from domas' personal server to Wikimedia's database dumps),
- repackaged and compressed by Erik Zachte at http://dumps.wikimedia.org/other/pagecounts-ez/ ,
- at archive.org (see this list) (announcement).
I liked the old design better!
[edit]You can access it by writing stats-classic instead of stats: http://stats-classic.grok.se/en/200910/Michael_Jackson
What do these columns represent in the original data sets?
[edit]The format of these files are as follows: <project> <page name> <access count number> <transfer size in bytes>
Are sisterprojects included?
[edit]Starting with 20080517-100000 other projects than Wikipedia are also included in the raw data, but not visibly in the interface. The code to be used in the url is the same as in raw data, so <subdomain>.[bdnqsvm], www.w for MediaWiki wiki, .m for all wikimedia.org subdomains, voy for Wikivoyage.
Except for Meta (e.g. http://stats.grok.se/meta/201005/Main_Page), Commons (e.g. http://stats.grok.se/commons/201005/Main_Page) and other projects added later (almost all of them), the page title and the back link to the page in question may be wrong, but this doesn't mean the stats are too, if your code is correct.
- Note: More detailed information about the format of URLs available here: http://www.archive.org/details/wikipedia_visitor_stats_200712 and http://dumps.wikimedia.org/other/pagecounts-raw/
Why are figures so low?
[edit]«A significant percentage (about a third) of pageviews weren't being logged due to packet loss on the aggregating server.» ([Wiki-research-l] Pageview data lost to packet loss) The problem possibly started on November 2009 and has been corrected in late July 2010.[1]
In December of 2011 there was also loss of data on Wikimedia's part.
For an amount of time in 2013 Google's indexing was all over the map, linking to https in some cases, mobile site in others, so pageview counts were felt as erratic.
HTTPS visits have been sometimes overcounted, but all known mistakes have been retroactively fixed.
Are they real pageviews?
[edit]Page views are not unique visitors, but the raw data is actually not about "views" either: it's just "pages loaded", when accessed at the normal URL like https://en.wikipedia.org/wiki/User:Killiondude/stats but not [2] etc.
It's not sampled data, it's not checked for outliers, It contains much impure data, for instance bots loading a page continuously for whatever reason and any stupid crawler not using the API, etc.
Moreover, it doesn't include requests to the mobile site, which is expected to serve about half of the pageviews at some point in 2015.
An extremely aggregate graph of all requests to WMF servers, having an unknown meaning, is also available at https://gdash.wikimedia.org/dashboards/reqsum/
What about mobile?
[edit]As said above, mobile pageviews are not included in stats.grok.se.
Raw data docs point out the existence of an "*.mw" key(s) for such views. However, if you download one of those raw pagecount files and 'grep' for that string, you'll find it appears exactly once, where the number aggregate number of mobile views over all articles is counted (i.e., one mega-aggregate number, not the several million article granularity ones we would expect/like).
Since 2014, a new raw data stream is available which should address this and other issues: pagecounts-all-sites.
Why did Special:Random and others disappear?
[edit]Since October 2014, "visits" which HTTP-redirect somewhere else are no longer counted, to avoid double counting etc. This affects, for instance, Special:Random, Special:MyPage and Special:MyLanguage. See bugzilla:71790 for details.
The only alternative to TB of raw data?
[edit]Finally, as of 2016, the Wikimedia Foundation provides a pageviews API which can be queried for pageviews data on individual wikis or pages, with virtually unlimited capacity and no need to download or parse data dumps.
There's been a lot of talk in the years (at least since 2010) on how to provide an alternative to stats.grok.se that would make it possible for other tools to query pageviews data without doing the hard processing of the raw files. See Magnus' Points of view, February 2014.
As of March 2014, a new service is also available for the English Wikipedia data, that offers machine-readable output as well: http://www.wikipediatrends.com/
Where is the code and where does it run?
[edit]The code is at https://github.com/abelsson/stats.grok.se
Since 2014-04-12, most traffic is served from a new, faster machine as the Wikimedia Foundation finally helped Henrik cover the costs to buy it. For many years, the site has run on Henrik's 2010 machine with a ~2.5GHz processor, 12 GB RAM and 8 TB disk.
What about referrers and locations?
[edit]It's not possible to filter visit statistics by referrers nor location (e.g. country): for privacy reasons, Wikimedia Foundation does not regularly publish such data.
Geolocation information is however used to publish regular and official per-country visits and edits statistics and more geolocation work is ongoing.
An English Wikipedia clickstream was also published, using private referrer information filtered in various ways.
Are there known dates for which complete sets have not been compiled although the data seems to be available
[edit]For English wikipedia the following dates appear to be compilable although they have not been done:
- 1/31/08
- 2/28/08
- 3/1/08
- 6/1/08
- 6/2/08
- 7/12-31/08
- 11/15/09
- 2/23/10
- 6/26/10*
- 9/2/2011
- 10/20/11
- 12/31/13?
- since 01/21/2016 to date
Compare Erik Zachte's list of dates which should not be used.
See also
[edit]- Emw's version is a visual tool written by Emw, based off of Henrik's data.
- WikiProject Popular pages lists written by Mr.Z-man This tool permits viewing statistics from any one month to another, but does not show day-by-day statistics. Does not go as far back in time as stats.grok.se
- Trending Topics, provides detailed view. Also does not go as far back in time as stats.grok.se
- WikiRoll: top viewed pages of the day/week/month/year on some Wikipedias, by Maciej Smoleński.
- WikiTrends: articles with biggest view increases (only Wikipedia)
- Wiki-Watch: Last 30 days, also works when stats.grosk.se is down. For Wikipedia English. For Wikipedia German: de.wiki-watch.de/
- Raw data used for third party programs or analyzing (Henrik's source); see also User:Emijrp/Wikipedia_Archive#Domas_visits_logs