User:Ixfd64/revision sizes
From Wikipedia, the free encyclopedia
This is an R script for visualizing the byte sizes of Wikipedia pages over time. I originally wrote this for an assignment for my Stat 133 class, but I have decided to release it to the public. Other Wikipedians are welcome to improve it as they see fit.
[edit] Source code
# current version: 1.2.0 # last update: November 19, 2009 # load the Tcl/Tk library require(tcltk) # create main GUI base = tktoplevel() tkwm.title(base,'Wikipedia revision sizes') # create main frame nfrm = tkframe(base) # create Tcl variables revisions.tcl = tclVar('') page.tcl = tclVar('') # create "Page" text field f1 = tkframe(nfrm) tkpack(tklabel(f1, text = 'Page'), side = 'left') tkpack(tkentry(f1, width = 25, textvariable = page.tcl), side = 'left') # create "Revisions" text field f2 = tkframe(nfrm) tkpack(tklabel(f2, text = 'Revisions'), side = 'left') tkpack(tkentry(f2, width = 8, textvariable = revisions.tcl), side = 'left') # pack GUI elements tkpack(f1, side='top') tkpack(f2, side='top') tkpack(nfrm) # language code # complete list available at http://meta.wikimedia.org/wiki/List_of_Wikipedias lang = 'en' # get page revision sizes getpage = function(...) { revisions = as.numeric(tclvalue(revisions.tcl)) page = as.character(tclvalue(page.tcl)) page.loc = gsub(' ', '_', page) # fix space parsing ge = function(s, g) substring(s, g, g + attr(g, 'match.length') - 1) byte.pat = '<span class="history-size">(\\([0-9bytesmtp ,]+\\))</span>' historylink = paste('http://', lang, '.wikipedia.org/w/index.php?title=', page.loc, '&limit=', as.character(revisions), '&action=history', sep='') history = readLines(historylink) if (any(grep('no revision history', history))) { tkmessageBox(title = 'Error', message = 'This page does not exist!', type = 'ok') } else { revisiondata = gsub(byte.pat, '\\1', mapply(ge, history[grep(byte.pat, history)], gregexpr(byte.pat, history[grep(byte.pat, history)], ignore.case = TRUE)), ignore.case = TRUE) revisiondata = as.character(revisiondata) revisiondata = gsub('[\\(\\), ]', '', revisiondata) for (a in 1:length(revisiondata)) { if (revisiondata[a] == 'empty') {revisiondata[a] = '0'} } revisiondata = gsub('[bytes]', '', revisiondata) revisiondata = as.numeric(revisiondata, ignore.na = TRUE) revisiondata = rev(revisiondata) plot(revisiondata, xlab = 'Revision index', ylab = paste('Revision size of \"', page, '\" (bytes)', sep = ''), type = 'l') } } # quit program destroy = function(...) tkdestroy(base) # create buttons bfrm = tkframe(base) tkpack(tkbutton(bfrm, text = 'Run', command = getpage), side = 'left') tkpack(tkbutton(bfrm, text = 'Quit', command = destroy), side = 'right') # pack bottom frame tkpack(bfrm, side = 'bottom')
[edit] Usage
The last 500 revisions of Milky Way
This script requires R to be installed. The best way to use this script is to put it in a text file located in the directory of your R workspace. For example, if you named your file wpgraph.txt, you can run the script by entering source('wpgraph.txt') in the R console.
[edit] Limitations
- This script does not work for edits made before April 19, 2007. Such revisions do not indicate the byte size due to an older version of MediaWiki being used at the time. Since this script uses regular expressions to determine the byte sizes, it will not work for revisions that do not have the size indicator.
- Certain characters are not processed correctly. For example, "1+1" must be entered as 1%2B1.