Wikipedia:Modelling Wikipedia's growth
The current rate of new articles initiated within the English Wikipedia is no longer growing exponentially. The new article rate seems to fit other models, such as a Gompertz function which predicts perhaps 4.4 million articles, a logistic function that can be projected to reach perhaps 3.5 million articles, or perhaps an extended-growth model, which predicts a much larger ultimate size. As of July 22, 2014 there are 4,562,745 articles.
- 1 Growth of the article count
- 1.1 Old exponential model for article count of Wikipedia
- 1.2 Critique of the exponential model
- 1.3 Logistic model for growth in article count of Wikipedia
- 1.4 Critique of the logistic model
- 1.5 Quadratic model for article count of Wikipedia
- 1.6 Extended-growth model
- 1.7 Two-phase exponential model
- 1.8 Gompertz model (2010–)
- 2 Data set for number of articles
- 3 Other measurements of article growth
- 4 Growth of Wikipedia network
- 5 See also
- 6 References
- 7 External links
Growth of the article count
The following graph shows the number of articles on the English Wikipedia from its creation in 2001 up to the present.
Here, several models are presented to attempt to explain the observed general trends in article growth.
Old exponential model for article count of Wikipedia
- Note: Between 2003 and 2006/2007 this was the general model for article count of Wikipedia.
Graphs of the article count for the English Wikipedia, from January 10, 2001, to September 9, 2007, based on statistics from this page and Wikipedia:Announcements. The two graphs show both logarithmic and linear y-axes. The graphs also show the approximate rate of article increase per day, along with the projected number of articles based on annual doubling referenced to January 1, 2003.
The growth in articles had been approximately 100% per year from 2003 through most of 2006, but has tailed off since roughly September 2006. The trend is no longer one of exponential growth, but has been closer to linear since that time.
A few notes on features of the graph:
- The start of the project showed a slow rise, which slowly increased in speed with time.
- The big slowdown in the rate of article creation in June–July 2002 was caused by major server performance problems, remedied by extensive work on the software.
- The sudden jump in article count in October 2002 is due to roughly 30,000 stub articles on U.S. towns and cities generated from a database being added by an auto-posting robot, Rambot, during an eight-day period. Although initially controversial as to whether these were "real" encyclopedia articles or merely "stubs", most of the Rambot articles have since been substantially expanded.
- Not counting the Rambot operation, the true maximum rate of article creation was in August 2006, when about 2400 net new articles were being added each day. From September 2006 through May 2007, the article count has increased by an average of about 1670 articles per day.
- During the first half of May 2007, the article growth rate dropped below 1500 articles per day, the lowest rate since October 2005. The growth rate has since rebounded to about 2000 articles per day from late July through early September 2007.
Critique of the exponential model
- Note: This was developed from 2004 to 2006.
The exponential model of Wikipedia growth is based on the following:
- more content leads to more traffic
- which leads to more edits
- which generate more content
Moreover, the average rate of growth is assumed to be proportional to the size of the Wikipedia, as a consequence of which, the growth would be exponential.
The graph of article count on the right is plotted on a logarithmic scale, so exponential growth should manifest itself as linear behavior of the data. Between October 2002 and July 2006, the data do fit very well along the dotted line shown, while from July 2006 onwards there is a noticeable fall off from linear behaviour. Before October 2002, the behaviour is more complex.
The graph on the right below is a close-up of the data points that follow a linear trend: the best-fit line in red was computed using linear regression. From the slope of this best-fit line, the proper time of the exponential growth can be found, giving:
The previous expression means that the number of articles doubled once every 346 days from October 2002 to October 2006, to a very good approximation. If Wikipedia had kept up with this trend, as shown on the graph, the number of articles by December 2006 would have been 1,900,000, by June 2007 2,800,000 and by December 2007 4,000,000, although there has been a slowdown of the growth and Wikipedia has apparently ceased growing exponentially.
The graph on the right is an exponential growth projection made in July 2006. The number of articles on the English Wikipedia up to July 2006 is shown in red, and this is extrapolated in blue using an exponential function (approximately 38000*exp(0.0017t) articles, where t is the number of days since January 1, 2001).
By the end of 2006, when there were 1.5 million articles, the projection was already overestimating the growth by 10-15%, and the prediction of over 3 million articles by the end of 2007 is significantly more than the actual figure of about 2.1 million articles.
It has been hypothesized that the growth rate of Wikipedia consists of a constant number of articles per day, submitted by "hard-core" wikipedians, with additional articles submitted by less enthusiastic wikipedians proportional to the current article count of Wikipedia. In this model the growth rate should be a linear function of the size of Wikipedia.
- is this model even remotely valid?
- how long can exponential growth go on, or is this just really the early part of a logistic curve?
- what does this imply for server and traffic scaling?
Eventually there will probably be a point where the amount of articles created each day will begin to slow down, due to a lack of things to write about. But it is probable that the amount of information in each article will begin to increase in lieu of an increase in the number of articles. Limitations on the (current) Wikipedia interface will cause a bottle neck of sorts, limiting the type (and by default, the amount) of growth to vertical monolingual growth patterns, as opposed to lateral cross-lingual ones.
Note that from the beginning of December 2005, only registered users can create new pages.
Logistic model for growth in article count of Wikipedia
- Note: This was developed in 2007.
If Wikipedia's growth follows the exponential growth model, the average rate of growth would be proportional to the size of the Wikipedia. The annual growth rate would stay constant, as would the average time the number of articles will double. As can been seen here and on the third graph this is not the case; the percentual growth is steadily declining.
Maybe Wikipedia's growth follows the logistic growth model better. This model is based on:
- more content leads to more traffic, which in turn leads to more new content
- however, more content also leads to less potential content, and hence less new content
- the limit is the combined expertise of the possible participants.
Some characteristics of this model are:
- there will be a maximum to the number of articles. On Wikipedia one can hardly imagine this as there will be new events and people to describe in the future. Compared to the large number of existing articles this is a very small effect though.
- at the end the growth is zero.
- at the pivot point (halfway the maximum) the growth is at its peak. For the en.wikipedia this might have been in August 2006 with 60,000 new articles a month.
This model is related to the quantity (number of articles). The quality might still increase independently.
A best fit of the logistic model to the statistical data available by the end of 2008 suggested that the growth limit to the number of articles, where on average the creation and deletion of articles are in balance, will be between 3 million and 3.5 million articles, with the 3 million point being reached around March 1, 2010.
However, by July 2009 it was clear that 3 million would be achieved by the end of August 2009 and the plateau would likely be close to 3.5 million articles.
Critique of the logistic model
- The model seems to imply that the number of articles that the Wikipedia will have is fixed.
- The rate of creation in the Wikipedia is unlikely to eventually fall to zero as new articles on new topics that arise due to new events and discoveries will still be required. As of June 2010, about half of the new articles created that weren't immediately deleted seem to be articles that couldn't have been created before 2001 when the Wikipedia started. This would seem to imply that 'logistic growth plus linear' might be a better model in the long run.
Quadratic model for article count of Wikipedia
- Note: At the end of 2008, WP:Size of Wikipedia#Annual growth rate used a simple model with a reducing rate of new articles to predict when growth would come to an end.
|Date||Article Count|| Increase during
| % Increase during
| Average Increase
per Day during Preceding Year
|NOTE: January 2018 is projected from 2009/ 2008/2007 (adding 60,000 fewer articles each year). Final article count plateau is: 2,679k + 470+410+350+290+230+170+110+50k = ~4,759,000 articles (deleted/merged articles will balance the number of added articles). Assumes same attitudes about notability, merging & lists.|
In 2009, the continued strong growth indicated there was no obvious nearby mid-point in the growth for new articles. Although growth was slowing, it was slowing more gradually, and could be expected to continue beyond another 15 years, creating up to 10 million articles. The predicted date for the 3-million-article mark would be much earlier, in mid-August 2009. The growth was supported by the need for various spin-off articles, such as unseen-hand and lost-world articles, millions of missing red-link articles, plus many thousands of new disambiguation pages needed to connect the other millions of pages. The new projected mid-point might occur in year 2011, although any massive auto-upload of numerous articles could change the schedule, such as a mass, automated effort to auto-generate red-link stubs with sources suggested from search-engine results. The continued strong growth fits the model reaching about 10 million articles, before deletions and merges would offset the increase of new articles being added.
Two-phase exponential model
The growth rate N'(t) of Wikipedia (number of new articles per unit of time) can be accurately modeled by two exponentials, one increasing ("phase 1") and one decreasing ("phase 2"), with a fairly sharp crossover around January 2006. In the following plots, the dots are the observed counts N(t) (cleaned and resampled at equal 28-day "months") and the respective increments N'(t) (new articles per 28-day month). The solid lines are the values of N'(t) and N(t) computed by the model.
Growth rate N'(t) - linear scale
Growth rate N'(t) - log scale
Article count N(t) - linear scale
Article count N(t) - log scale
Seasonal modulation since 2006
Since 2006, there is also a strong semestral variation in the new article rate, with peaks in February and August. The following plots include this modulating factor:
Growth rate N'(t) - linear scale
Growth rate N'(t) - log scale
Article count N(t) - linear scale
Article count N(t) - log scale
Some implications of this model:
- The slowdown is not a "natural" phenomenon but rather the consequence of some change in Wikipedia policy and/or tools.
- The "fertility" of Wikipedia's corps of editors (their output of new articles) is shrinking.
- Wikipedia will nearly stop growing before reaching 6 million articles.
Here is the text file with the data used to generate these plots. The first column is the time t, specifiaclly elapsed days since January 1, 2001. Columns 2,3,4 are year,month,day. Column 5 is the observed article count N(t) on that date (cleaned and resampled). Column 7 is the value of N(t) predicted by the model. Columns 9 and 11 are the observed and predicted growth rates N'(t) in articles per "lunar" month (28 days). There is also a technical report describing the model and the data set.
Gompertz model (2010–)
This model is based on the Gompertz function. The Gompertz function is like a logistic function, but the future value asymptote of the function is approached much more gradually, in contrast to the logistic function in which both asymptotes are approached by the curve symmetrically.
The reasons for this new model are
- The growth rate function does not seem to be time-symmetrical, unlike the logistic function
- The percentage of article growth per month in the logarithmic graphs seem to be linear ( (1) and (2) ), as the Gompertz function
The formula for the Gompertz function for the en.wikipedia is , with
- a= 4378449 (the predicted maximum for about 4.4 million articles)
- b= -15.42677
- c= -0.384124
- t is the time in years since 1/1/2000 (so 1/1/2010 is t=10.00)
The expected maximum of the Gompertz model is between the logistic model and the Modelling Wikipedia extended growth.
See below 3 Gompertz model graphs, followed by 3 corresponding graphs of the Logistic model, a graph for a general comparison between the Logistic, Gompertz and the Extended Growth models, and a graph of the top 20 wikipedia's which in general show the same behavior in Percentage of article growth.
A possible refinement
A small but significant disparity has started to develop between the measured article count and the fitted Gompertz curve, with the article count rising faster than predicted since mid-2011. This has gone on for long enough now that I think it's likely that it's not a glitch.
One possible model, based on visual inspection of File:EnwikipediapercgrowthGom.PNG, might be a Gompertz curve with a small additional constant exponential growth term, , which would have the property that the small term would be "uncovered" only in the latter stages of the Gompertz growth curve, because it would be dominated by the term prior to that point.
Applying this to the data at Wikipedia:Size_of_Wikipedia#The_data_set, using a bit of numerical optimization to find the parameters, gives a much better fit to at least the most recent parts of the data, like this:
although with the extra parameter, it gets much easier to fit any curve, and there's a danger of overfitting. It also fits less well at the start, before the beginning of the fitting window in 2004.5 (done to remove the wild growth fluctuations of the Rambot-era data). But it adds some plausibility for the model, and at the very least provides a plausible-seeming new ad-hoc extrapolation that can be compared against the other candidates in the future.
Here are the corresponding percentage interval-to-interval changes, using the data series resampled into 0.05 year intervals, with a log scale on the y-axis, showing the closeness of fit from 2005 onwards:
Here are the corresponding results for dewiki:, which didn't have the initial 2002-era server slowdown/Rambot perturbations found in the enwiki: data:
Data set for number of articles
As Erik Zachte's statistics for the English language wikipedia is not updated since October 2006, these are the figures I (HenkvD) use for generating the graphs. The data up to October 2006 was taken from one of Erik's Downloads. The data since I took manually each month at the date (or a day later) using the Special:Statistics page. See also Wikipedia:Size of Wikipedia#The data set for a list of values of the official count, recorded manually at irregular intervals.
Date, Number of articles
Other measurements of article growth
Edits per article
The following graph shows the mean number of edits per article, and is intended as a measure of the quality of the articles, assuming that editing improves the content.
The graph is plotted in logarithmic scale, and this data also fits well with exponential growth starting from October 2002. The number of edits per article has since doubled once every 505 days, a rate consistent with Moore's law.
Modelling growth of Wikipedia page views per million
Using the Alexa page views per million data from Wikipedia:Awareness statistics (see  for a graph) in the period 1 January 2003 to 5 September 2005, filtering out all points less than 28 days away from the previous point (to avoid excessive weighting during time periods where points are densely sampled), and performing a linear least-squares fit of the logarithm of the data, gives the following approximate formula:
- log_e(page_views_per_million) = -50 + 5e-08 * unix_epoch_of_date
for n = 21 points fitted
This implies a doubling period of (log_e(2) / 5e-08) / 86400 days, which is approximately 160 days, and an annual growth factor in page views per million of appoximately exp(5e-08*365*86400), which is approximately 5.
Playing around with different time periods and filter times, we get a range of results from which can reasonably say that Wikipedia's estimated page views per million doubling time is somewhere in the range 130 - 160 days, with the recent (2005) doubling time of 156 days or so being within the range of the longest-term doubling time of about 155 - 159 days, with the 2004 period being the exception to the long-term and short-term trends.
Modelling improvement in Wikipedia's Alexa traffic rank
Applying a similar linear regression fit to the log of Wikipedia's Alexa traffic rank from October 2002 to September 2005 gives a similar result, with a halving period (lower is better for rank) of roughly 134 - 138 days over the long term, with a 2005-data-only halving time of 114 days! Since the page rank as of September 2005[update], was roughly 40, this suggested, if taken to logical extremes, and using the most cautious of the three figures, and rounding it to 4.5 months, that Wikipedia would reach:
- page rank 20 in 4.5 months
- page rank 10 in 9 months
- page rank 5 in 13.5 months
- be fighting its way into the top 3 in 18 months, and
- be fighting its way to the #1 spot in 22.5 months...
So, clearly this exponential growth had got to stop or slow down, or it was going to be a wild ride...
November 2005 — the daily page rank averaging 34 and reached 31 in October.
January 2006 — the daily page rank averaging 20 for about a week; in line with the original predictions above.
April 2006 — averaging 16/17 this month, although in March it reached as high as rank 12, the then record.
July 2006 — deviating from predictions; Wikipedia was supposed to have reached rank 10 by now, yet for the whole of June we hovered between 16/18.
September 2006 — Heavily deviating from predictions; by the end of October, Wikipedia was supposed to reach rank 5, yet still only making small gains, hovering between 14/16 now. The climb up the rankings has slowed down - but for now we are still climbing! Wikipedia has broken the "50,000 reach" barrier, meaning we reach as many people as youtube.com and even more than myspace.com!
November 2006 — Alexa weekly rank now 12, and is still climbing, with occasional daily blips up to 11. Wikipedia once made the daily rank in the top 10 on 12th!
February 2007 — 18 months after the predictions, I think it's safe to say the model is flawed. We should be ranked as 3rd, but the high is 8, with the average being 10/11. We're still getting gaining popularity, just not as fast as expected.
May 2008 — Swaying between 7 and 8 for the past few months with 8 being slightly more common. The rise, though slow, continues.
December 2008 — The traffic rank continues to be around 8. No clear trend is evident in the rank, but the number of daily pageviews displays a steady decline since June 2008.
March 2009 — The traffic rank is consistently 7 for more than 6 weeks now, and has not been below 8 for three months. The half-year graph suggests a transition period from October to February for the move from rank 8 to 7. Pageviews have slightly recovered, again reaching July 2008 levels, though still far from those of June 2008.
June 2009 — Fairly consistently 7, with only intermittent falls to 8. Pageviews are fairly steady at around 0.5% of global, with a very slight upward trend evident.
September 2009 - Spending more time at 6, with intermittent returns to 7. Pageviews are about 0.55-0.6% of global with an upward trend still evident.
November 2009 - Mostly at 6, with occasional returns to 7. Pageviews are level at about 0.53-0.6% of global.
April 2011 - at 8. However, ComScore results as of January 2010 put all Wikimedia properties collectively at 5: see http://meta.wikimedia.org/wiki/User:Stu/comScore_data_on_Wikimedia
November 11 2012 - Back to 6, with 13% reach. For comparison, Google at position 2 worldwide has about four times the reach at 46%
December 29 2013 - Global rank 6, U.S.-only rank 7.
Growth of Wikipedia network
In the context of complex network theory there is a number of efforts to model the growth of Wikipedia network in which the nodes represent the articles and links are the hyper links between articles. This type of models are based on simple local probabilistic rules which should reproduce different distributions of Wikipedias statistical variables. Analysis show that the distribution of the number of hyper links pointing to a given article have a very stable power law exponent for a number of Wikipedias in different languages. It was also confirmed that the reciprocity - ratio between the number of hyper links connecting two articles in both directions to the total number of hyper links is a very stable across the number of different Wikipedias.
- Law of accelerating returns
- User:Dragons flight/Log analysis
- Wikipedia:Academic studies of Wikipedia
- Wikipedia:Inflationary hypothesis of Wikipedia growth
- Wikipedia:Modelling Wikipedia's growth/Previous traffic predictions
- Data from en:Wikipedia:Database download
- Zlatić, Vinko; Štefančić, Hrvoje (2009), Model of Wikipedia growth based on information exchange via reciprocal arcs, arXiv:0902.3548
- Capocci, A.; Servedio, V. D.; Colaiori, F.; Buriol, L. S.; Donato, D.; Leonardi, S.; Caldarelli, G. (2006), "Preferential attachment in the growth of social networks: the internet encyclopedia Wikipedia", Physical Review E 74 (3): 036116, doi:10.1103/PhysRevE.74.036116
- First-Order Nonhomogeneous Linear Differential Equations (Broken link)
- Suh, B., Convertino, G., Chi, E. H., & Pirolli, P. (2009) The singularity is not near: slowing growth of Wikipedia. WikiSym, (pp. 8:1-10).
- Aniket Kittur, Ed H. Chi, Bryan A. Pendleton, Bongwon Suh, Todd Mytkowicz. Power of the Few vs. Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie. Presented at alt.CHI at ACM SIGCHI Conference 2007. April, 2007. San Jose, CA., a paper examining what user groups are behind most of the edits in Wikipedia.