Wikipedia:Modelling Wikipedia extended growth
|This essay contains the advice or opinions of one or more Wikipedia contributors. Essays are not Wikipedia policies or guidelines. Some essays represent widespread norms; others only represent minority viewpoints.|
|This page in a nutshell: This essay describes issues projecting the long-term growth of Wikipedia, beyond year 2025, with up to 10 million articles, including growth for resolved red-link articles, spinoffs, disambiguation pages, lost-world, fan-cruft or unseen-hand articles. Predicted date of 3 million articles: mid-August 2009 (was 17Aug09).|
This essay, Wikipedia:Modelling Wikipedia extended growth, considers many issues about projecting the long-term growth of Wikipedia, including growth promoted by many other types of articles, beyond the traditional encyclopedia and major pop-culture articles. The extended-growth model considers factors that will create millions of new articles, far beyond the current 4.7 million articles (live count), to reach perhaps 9 to 10 million articles, before deletions offset the creations of new articles.
The growth of Wikipedia, although reduced somewhat, is not slowing as much as predicted in 2007, but not skyrocketing either, as predicted in 2005. The extended model predicted Wikipedia as exceeding a total of 3 million articles in mid-August 2009, rather than at year-end (occurred 17 August 2009). The model predicted the 3.5 millionth article would be added in mid-September 2010, but occurred on 12 December 2010 instead.
|Wikipedia size & users|
|Articles per day:||+2889|
|Total wiki pages:||33,823,661|
|UTC time: 11:38 on 2014-Sep-21|
- 1 Wikipedia extended growth model
- 2 Continued growth for follow-on articles
- 3 Psychological motivation for new articles
- 4 Growth as a percentage of prior year
- 5 Projections could change radically
- 6 See also
Wikipedia extended growth model
An alternate possibility for the growth of Wikipedia is a more protracted, long-term decline in new articles: not the original exponential burst that doubled each year, but neither a balanced bell curve that peaked in late 2006. Instead, an extended-growth model should be considered with the middle, or mean size, to occur during 2010–2011, to double that size of 4 or 5 million articles to nearly 9–10 million articles, long term. The additional millions of articles will be various types of follow-on articles, after the major articles have mostly become stable.
The psychological motivation for the follow-on articles might be a feeling that Wikipedia needs to answer some basic questions about any notable topic that anyone can think about. That motivation is probably much stronger than refining the existing articles to be a comprehensive treatment of each topic (see below: Psychological motivation).
Graphical model fits overall pattern
The model was developed as a graphical curve to fit the overall pattern of the data, which does not follow a simple mathematical model because many batches of new articles are added by wiki-bot and short-term groups, rather than as "random" additions by the general public. Hence, there is no simple equation which could fit the actual data, which fluctuates wildly when robotic bot-programs are triggered to load numerous new articles in some months, such as for numerous protein-sequence articles. There is no simple mathematical "process generator" to simulate new-article growth. A detailed operational model would not be an equation, but rather a logical, procedural computerized model. However, the growth impact of articles from the general public has been much greater, than the short-term group efforts, so that the overall pattern appears to be a somewhat linear decline in the growth rate for new articles, averaged for several months, or 3 future years, at a time. Perhaps a rough equation would reduce the new-article growth by 11% each year, with the understanding that the decline slows further in each June/July but rises in August, each year, probably tied to school vacations in the Northern Hemisphere. Bear in mind that if a massive bot were triggered to load 700,000 new articles from a "Who's Who in Science" then the new-article rate would soar for months, and always appear as an anomaly, as an upward bump, in the declining overall curve during the next 20-65 years.
Continued growth for follow-on articles
The initial base of Wikipedia articles covered the traditional encyclopedia and mainstream pop-culture articles, including historical figures, world events, catalogs of scientific terms, celebrities, entertainment topics, and famous sports figures. Those topics, after 6 years of expansion were thought to be saturated, so that the primary growth of Wikipedia would quickly decline and end within 5 years.
However, growth can be expected for several other types of articles:
- unresolved redlink articles – linked because authors expected notability (someday);
- spinoffs – sets of sub-articles created when large articles are split;
- disambiguation pages – whenever 2 or more articles have similar titles, expect a page to separate them;
- unseen-hand articles – these are the supporting cast & crew, or assistant leaders, as the power behind the throne that made things happen;
- lost-world articles – these are the long-lost, buried civilizations, failed inventions, secret societies, or forgotten heroes;
- also-ran articles – these are the contenders, or losing players, just outside the lime light; and
- technical artifacts - such as cars, consumer electronics, electrical parts, scientific instruments, software, weapons. Thousands (millions) of new models enter the market each year, and millions were notable from the past (e.g. IBM 1620).
- chemicals - it is estimated that some 10 million substances like 2,2-dimethylbutane have been described non-trivially in the literature (with some other information besides their mere existence and formula).
- species - estimates for the number of species range in the millions, all with some nontrivial information published somewhere.
- stars – there are several star catalogues with millions of stars listed, as yet there is only a fraction listed on Wikipedia.
- fan-cruft articles – these are detailed or pop-culture topics, such as one-event clothing designs, that get mentioned (briefly) in mainstream news.
Note that even the fan-cruft articles will be notable, because thousands or millions of people might be affected, however briefly, and the topic will be covered by some mainstream media sources.
- additional articles for new things of established sorts: new books, new films, newly notably performers, newly elected politicians, new major athletes, new scientific discoveries, new major products. Where major prizes are notable, there will be new people in that group every year. This portion can never become saturated, though the growth can become linear.
- expansion of the enWP into fuller coverage of the other culture areas; for example, we are much more saturated for UK railroad stations than for ones in India.
Because of the large array of follow-on articles, there seems to be great potential for creating masses of new articles, beyond the millions of traditional encyclopedia and major pop-culture articles.
Annual growth rate of new articles
The table below shows the increasing article counts for the English Wikipedia:
|Date||Article count|| Increase during
| % Increase during
|Average increase per
day during preceding year
If Wikipedia's growth were nearing an end, then many articles would have most major redlinks already resolved with the intended linked articles. However, many articles still recommend 6 or more redlinked articles. Improbable redlinks are often removed from articles, so the remaining redlinks are typically notable. They will include: nearby mountain names, wildlife reserves, rivers, bays, towns, key personnel, book/film titles, special varieties, etc. Such topics are easily defended as being notable, so the redlinks are a major influence on creating new notable articles.
Articles as disambiguation pages
A common type of new article is a disambiguation page, which offers a choice of articles related to the same title. Originally, the choice was between items having exactly the same name, such as "John Smith" or "Mary Jones" or "Leonardo". However, variations of a title were added as potential matches, in a manner similar to word-prefix searches. As a result, disambiguation pages began listing organized groups of potential matches for a partial title, carefully grouping people, companies, towns, films (etc.) with a short description of each.
A disambiguation page can be so comprehensive, and descriptive, that it acts like search-engine results "on steroids", as a structured, informative scan that would be a lofty goal for a search-engine to attain. Because of the exceptional information distilled by the disambiguation pages, they can be valuable additions to Wikipedia, and hence, a major source of welcomed new pages. In February 2009, Wikipedia had nearly 108,000 disambiguation pages, more than the entire size of Wikipedia back in early 2003. In early 2009, the daily growth of new articles included, perhaps, nearly 1–2% disambiguation pages. By 2014 the count of disambiguation pages had grown to over 250,000.
Articles as lost-world topics
The search for knowledge often illuminates the worlds of yesteryear. Archaeologists have excavated for decades at Emperor Qin's Terracotta Army in Xian (China), at the fields of Ephesus, the hills of Copan, the ruins at Carchemish, inside many Caribbean shipwrecks, under lava flows near Pompeii, and in ancient temples at Edfu, Abydos or Kom Ombo along the Nile. As new discoveries are pieced together, thousands of ancient topics gain the details to become full articles.
The world of antiques, with furniture and household items, instantly provides many thousands of topics for new articles.
Paleontologists are expanding the fossil record in many areas: as the arctic glaciers melt, numerous fossils are sometimes found on the surface under the ice; and even in Africa, where dinosaur remains were rarely seen, numerous fossils are being discovered.
Many thousands of articles can be expected on lost-world topics.
Articles as unseen-hand issues
Behind, or beneath, the major, popular topics, are the "unseen hand" articles. The supporting cast and crew (sometimes with a "cast of thousands") eventually becomes known well enough to fill new articles.
Psychological motivation for new articles
The English Wikipedia, since early 2005, has added over 1,000 new articles every day. However, the number of articles being refined and polished to meet featured-article status is only a few a day. Clearly, the ratio of 1 featured article for every thousand indicates some key psychological factors are involved.
The psychological motivation for creating so many new, follow-on articles might be a feeling that Wikipedia needs to answer some basic questions about almost any notable topic that anyone can imagine. For example, there are over 33,000 English Wikipedia articles about professional footballers (soccer players), and many of those articles are read daily, by someone somewhere. In contrast, for the more traditional field of mathematics, there are perhaps 21,000 total articles. However, new articles are still being added.
Meanwhile, the process of refining articles to reach featured-article status, as a comparison, involves weeks of changes and reviews. Plus, the criteria used to screen articles can become severe: some even request that the phrasing within an article be made more diverse, by eliminating repetition of ordinary phrases. It is not enough to just describe all major aspects of a topic, those articles must meet certain literary standards. During 2008, over 100 articles lost their featured-article status, as criteria became perhaps more strict about the quality required for featured-level.
As a consequence, the motivation is probably much stronger, to create new (brief) articles which provide a general introduction to each topic, rather than refining or polishing the existing articles to become comprehensive treatments of their topics, according to a carefully defined set of high-quality criteria.
Growth as a percentage of prior year
Although the decline in daily growth has only occurred for about 6 years, it is possible that the annual decline is about 9% fewer new daily articles, each year. So, the next year would add only 91% of the prior year's new-article count. Using that form of model, then the total articles would continue to grow, even beyond year 2040, before the added articles would become offset by daily deleted articles.
The following table shows each year & daily new-article count, reducing by 17% annually:
2008 – 1437
2015 – 428
2022 – 116
2029 – 31
2036 – 9
Beyond 2008, the daily new-article count (declining by 17% each year) is only an approximation: the purpose of the table is to show how the article growth could easily continue past year 2040. However, the actual new-article counts are likely to differ greatly (from the table values). Note that the actual counts could jump much higher, especially, if bot programs are written (someday) to auto-generate stub articles for redlinks, such as auto-searching for matching source webpages, then auto-generating footnotes and inserting a few key phrases or infobox details (copied from source webpages) within each stub article.
If the annual decline, actually, were to slow even less, such as to then average 83% of the prior year, the daily new-article count (in year 2035) could become: 10 new articles per day (as the projected daily average in year 2035).
Projections could change radically
Beware that the ongoing projections assume a continuation, of the prior types, of new articles. Any drastic change in mass uploads or new-article restrictions could radically alter the rate of new-article creation. For example:
- If some Wikiproject decided to auto-upload new articles, generated as stubs, from a huge database of "Who's who in science", then a massive upsurge would occur for new articles.
- In contrast, if Wikipedia policies were quickly changed to demand sources, such as requiring 2 independent sources per new stub, then new-article creation could fall to just a few dozen a day.
Because of the widespread impact of mass uploads or new-article restrictions, the actual growth figures could veer widely from the projected levels, within only weeks of the current time.
- Wikipedia:Modelling Wikipedia's growth – essay comparing with older models
- Wikipedia:Pruning article revisions
- Wikipedia:Overlink crisis
- [ This essay is a rapid draft, created with very limited time. ]