Talk:Zipf's law

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Statistics (Rated B-class, Mid-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B-Class article B  This article has been rated as B-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.
 
WikiProject Mathematics (Rated B-class, Mid-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
B Class
Mid Importance
 Field: Probability and statistics
WikiProject Linguistics / Applied Linguistics  (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of Linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
Taskforce icon
This article is supported by the Applied Linguistics Task Force.
 

Untitled[edit]

Is it true that the word "the" does indeed occur about twice as often as the next common English word? The rest of the article seems to allow for some proportionality constants. AxelBoldt

It's not true, so I replaced it with a statement about Shakespeare's plays. AxelBoldt

random typing[edit]

The main article claimed that

  • the frequency distribution of words generated by random typing follows Zipf's law.

I doubt that very much. For one thing, if you type randomly, all words of length one will be equally likely, all words of length 2 will be equally likely and so on. Or am I missing something? Maybe we should perform a little perl experiment. AxelBoldt

I tried this Python code, and plotted the results in a log-log plot -- the early ranks are a bit stepped, but the overall pattern fits Zipf's law rather well. The Anome

import random
import string
import math
 
N=10000
M=100
 
words = {}
for j in range(M):
  str = []
  for i in range(N):
    str.append(random.choice('aaabccdefg     '))
  str = string.join(str, '')
  str = string.split(str)
  for word in str:
    if words.has_key(word):
        words[word] += 1
    else:
        words[word] = 1
  print 'did string pass', j
 
vals = words.values()
vals.sort()
vals.reverse()
 
file = open('zipf_ranks.txt', 'w')
# Let's just have the first few ranks
useranks = min(1000, len(vals))
for i in range(useranks):
    rank = i+1
    file.write("%d %d\n" % (rank, vals[rank-1]))

Could you repeat the experiment with all letters and the space getting the same probability? That's at least what I thought off when I heard "random typing". AxelBoldt

When I was told of the "random typing" experiment, the "typing" part was more important than "random". If you sit at a keyboard and randomly type, you have a much higher chance of hitting certain keys than others because your fingers tend to like certain positions. Also, the human brain is a pattern-matching machine, so it works in patters. If you look at what you type, you might notice you tend to type the same sequences of letters over and over and over.
For a brief historical note, this theory was used in cracking the one-time-pad cyphers. Humans typed the pads, so it was possible to guess at the probability of future cyphers by knowing past ones. It is like playing the lottery knowing that a 6 has a 80% chance of being the first number while a 2 has a 5% chance. If you knew the percentage chance for each number and each position, you have a much greater chance of winning over time. Kainaw 13:59, 24 Sep 2004 (UTC)

I can't right now, but I'll give you the reason for the skewed probabilities -- the space is by far the most common character in English, and other chars have different probabilities -- I wanted to model that. The Anome


The main article claimed that

  • the frequency distribution of words generated by random typing follows Zipf's law.

I doubt that very much. For one thing, if you type randomly, all words of length one will be equally likely, all words of length 2 will be equally likely and so on. Or am I missing something?

Yes, you're missing something.

It doesn't *exactly* match Zipf's law. But then, no real measurement exactly matches Zipf's law -- there's always "measurement noise". It does come pretty close. In most English text, about 1/5 of all the characters are space characters. If we randomly type the space and 4 other letters (with equal letter frequencies), then we expect words to have one of these discrete probabilities:

1/4*1/5: each of the 4 single-letter words
1/4*1/4*1/5: each of the 4*4 two-letter words
1/4*1/4*1/4*1/5: each of the 4*4*4 three-letter words.
...
(1/5)*(1/4)^n: each of the 4^n n-letter words.

This is a stair-step graph, as you pointed out. However, if we plot it on a log-log graph, we get

 x = log( 4^n / 2 ) = n * log(4) - log(2).
 y = log( 1/5 * (1/4)^n ) = -n*log(4) - log(5).

which is pretty close to a straight line (and therefore a Zipf distribution), with slope

 m = Δy / Δx = -1.

You get the same stair-step on top of a straight line no matter how many letters (plus space) you use, with equal letter frequencies. If the letter frequencies are unequal (but still memoryless), I think that rounds off the corners and makes things even closer to a Zipf distribution. --DavidCary 01:54, 12 Feb 2005 (UTC)

The problem here seems to be arising from the ambiguarity of the phrase "random typing".To avoid this I'll replace the phrase with "random sampling of a text". Jekyll

It's already gone, never mind. Jekyll 14:45, 18 November 2005 (UTC)


this is a pretty minor example of Zipf's law. A much more important one is that the frequency of occurrence of words in natural language conform to Zipf's law. This is in the original 1949 reference. Callivert (talk) 11:29, 24 March 2008 (UTC)

Why ?[edit]

Moved from main article:

We need an explanation here: why do these distributions follow Zipf's law?
No we don't. Zipf's law is empirical, not theoretical. We don't know why it works. But even without a theory, even the simplest experiments that try to model a society of independent actors consistently turn it up!--BJT

Well, empirical facts have to be explained too. It's not enough to simply state that the moon always shows us the same side; you have to give the reason if you try to understand the world. It's the same here. If Zipfian distributions show up in a variety of situations, then there must be some underlying principle which generates them. I doubt very much that "we don't know" that principle. AxelBoldt

I agree - every theory begins with empirical evidence. The theory models an explanation to fit those facts. I'm sure someone has tried to come up with an explanation? 70.93.249.46

We need an explanation here: why do these distributions follow Zipf's law?

An excellent question. However, I doubt there is a single cause that can explain every occurance of Zipf's law. (For some distributions, such as wealth distribution, the cause of the distribution is controversial).

Well, empirical facts have to be explained too. It's not enough to simply state that the moon always shows us the same side; you have to give the reason if you try to understand the world.

Good point. However, sometimes we don't yet know the cause of some empirical facts -- we can't yet give a good explanation. In those cases, I would prefer the Wikipedia article to bluntly tell me "we don't know yet" rather than try to dance around that fact.



While it is true that Zipf's law is empirical, I agree with AxelBoldt that it is useful to have an interpretation of it. The most obvious place to look is the book that Zipf himself wrote in which he linked his observation to the Principle of Least-Effort, kind of an application of Conservation of Energy to human behavior.


I've just measured Polish Wikipedia page access distribution using Apache logs (so they had to be heavily Perlscripted) for about 2 weeks of late July, only for main namespace articles, and excluding Main Page. In most part it seems to be following Zipf's law with b about 0.5, except at both ends, where it behaves a bit weird (what was to be expected). Now why did I get constant so grossly different from stated constant for English Wikipedia ?

Some possibilities:

  • Polish and English Wikipedias really have different Zipf's factors
  • It was due to my perlscripting
  • Measurment given for English Wikipedia here is wrong for some reason, like measuring only top 100, and not all articles.

Taw 01:38, 4 Aug 2003 (UTC)




I think we may be fast approaching the point when merging this article with Zipf-Mandelbrot law would be appropriating, along the way doing some reorganizing of the article. Michael Hardy 22:57, 6 Dec 2003 (UTC)

Although the reason is not well-understood, mechanisms that bring about the Zipf distribution have been suggested by physicists. Power laws tend to crop up in systems where the entities (words in this case) are not independent, but interact locally. The choice of a word isn't random, neither does it follow a mechanistic prescription - the choice of a word depends strongly on what other words have already been chosen in the same sentence/paragraph. I think these speculations should be mentioned in the article as a side note, for the sake of completeness. 137.222.40.132 12:45, 17 October 2005 (UTC)



It is meaningful to ask why in general a particular distribution is found in nature - the passage above is a good start; I'd like more clarification - for example, the normal distribution arises when their the outcome is caused by a large number of minor factors where no particular factor predominates. The bimodal distribution arises when their are a large number of minor factors coupled with one predominant factor. The Poisson distribution arises when an event is the consequence of a large number of rare events converging. Etc. For Zipf's distribution, I would like to know, why does interdependence of events lead to it?



Another interesting example of Zipf's law is the distribution of 25,000,000 organic compounds among the possible shapes that the ring structures can take (download PDF or HTML). Half of all the compounds have just 143 shapes (out of more nearly 850,000 shapes). The authors refer to a "rich-get-richer" reason: When we are speaking or writing, the probability that we will use a particular word is proportional to the number of times that we have already used it. The particular reason for organic chemistry is that is quicker and cheaper to synthesize new compounds if you can buy the "parts" commercially, or if syntheses for the "parts" have already been published; thus the most common shapes tend to be used more.

In these word-list examples we also have to do with a fundamental fact of linguistics: Every language has a few dozen "structural" words, mostly articles, conjunctions, and prepositions, which have to be used to achieve proper syntax (note the three most popular words in English). This can skew the distribution of the most popular words.

This would also apply to populations of cities. A large city has many opportunities for work and leisure, so it will attract new inhabitants and keep its current ones. On the whole, one would expect that the probability that new inhabitants will come would be, very roughly, proportional to the existing population.

So the reason seems to be that prior use encourages further use. Almost banal.

--Solo Owl (talk) 17:26, 7 September 2008 (UTC)

Consistency of variables in text[edit]

In the examples section, the variable quoted in each case is b. However, this variable is not used anywhere else. Some consistency throughout the article would be nice (and less confusing!). — 130.209.6.41 17:01, 1 Jun 2004 (UTC)

I agree - please explain what is b ? \Mikez 10:00, 8 Jun 2004 (UTC)
I was about to add something about this as well... is it what is called s in the discussion of formulas? It's not clear from the text. -- pne 14:01, 8 Jun 2004 (UTC)
Well, Ive seen Zipf's Law stated as f_n=[\mbox{constant}]/n^b. So Im pretty sure that s and b are the same thing. Im changing b to s in the "Examples..." section. -- Aparajit 06:01, Jun 24, 2004 (UTC)
Can someone check the values of b / s given in the examples? Especially the word frequency example. I took the data for word frequencies in Hamlet, and fitted a line to the log-log plot. This gave a slope of more like 1.1, rather than the 0.5 figure quoted here. Taking the merged frequencies over the complete set of plays gives a value which gets towards 1.3. This would agree more with the origin of Zipf's law, which is that the frequency of the i'th word in a written text is proportional to 1/i. The value of 0.5 seems much too small to match this observation. Graham.

Linked site doesn't exist[edit]

It seems to me that the special page of the [popular pages] no longer exists, but it's used for an example in this page.

Has this page simply moved or do we need to get a new example?

reported constants in Examples section[edit]

the examples section reports values of s < 1 as resulting from analysis of wikipedia page view data. the earlier discussion correctly notes that such values do not yield a valid probability distribution. what gives? perhaps (s - 1) is being reported?

Or it could be that that value of s is right for a moderately large (hundreds?) finite number of pages. That seems to happen with some usenet posting statistics. Michael Hardy 22:07, 28 Jun 2004 (UTC)

I miss reference to Zipf's (other) law: the principle of least effort.

s = 1?[edit]

I plotted Shakespeare's word frequency lists and top 5000 words in Wikipedia: [1] Where did the old value of s~0.5 come from? -- Nichtich 00:27, 23 Jun 2005 (UTC)

Sources needed for examples[edit]

Section 3 ("Examples of collections approximately obeying Zipf's law") has a bunch of examples with no further explanation or reference (Shakespeare excluded). That's highly undesirable, so let's get some sources. By the way, I find the final point (notes in a musical performance) very questionable. I imagine it would depend very much on the type of music. EldKatt (Talk) 13:20, 19 July 2005 (UTC)

It might not be a bad idea to give examples from Zipf's book. Also see the article by Richard Perline in the February 2005 issue of Statistical Science. Michael Hardy 22:28, 19 July 2005 (UTC)

Too technical?[edit]

Rd232 added the "technical" template to the article. I've moved it here per template. Paul August 03:49, 27 November 2005 (UTC)

There have been numerous edits to the article since the template was first added almost a year ago. Also, there has been no discussion of what about the article is too technical so I've removed the template. Feel free to put it back, but if you do, please leave some comments as to what you find is too technical and some suggestions as to how to improve the article. Lunch 02:25, 21 November 2006 (UTC)

Does Wikipedia traffic obey Zipf's law?[edit]

Yes, apparently, with an exponent of 0.5. See Wikipedia:Does Wikipedia traffic obey Zipf's law? for more. -- The Anome 22:45, 20 September 2006 (UTC)

Wikipedia's Zipf law[edit]

Just a plot of English Wikipedia word frequencies: http://oc-co.org/?p=79

Is this plot availiable to Wikipedia - i.e. Free content? It would look good in the article.
Yes, it is released under LGPL by me, the author :) -- Victor Grishchenko
I downloaded it, tagged it as LGPL with you as author, and put it into the article - please check it out and make corrections if needed. This is an excellent demonstration of Zipf's law (and its limitations). Thanks! PAR 15:00, 29 November 2006 (UTC)

This article has 107 "the" and 68 "of" last time i checked.


—Preceding unsigned comment added by 200.198.220.131 (talk) 13:35, 5 October 2009 (UTC)

Now this is interesting. It is, however, not a true power law, as Grishchenko shows quite clearly by superposing blue and green lines. That is, English Wikipedia is not an example of Zipf’s law. The tail (the lower righthand part of the curve) is not typical of power laws.

The first part of the plot, for the 8000 or so most common words, does follow a power law, with exponent slightly greater than 1, just as we would expect from Zipf’s Law. The rest of the plot, for another million different words in English Wikipedia, follows a power law with exponent approximately 2; this part has the tail that we look for in power laws.

My guess is that in writing encyclopedia articles, one must select from the 8000 most common words just to create readable prose. Hence the log-log plot with slope ~1.

An encyclopedia, by its nature, is also going to contain hundreds of thousands of technical terms and proper names, which will be used infrequently in the encyclopedia as a whole. Hence the log-log plot with a steeper slope. Why should this slope be ~2? Does anyone know?

Does Zipf’s law apply to individual Wikipedia articles? Do other encyclopedias behave the same?

--Solo Owl (talk) 17:59, 7 September 2008 (UTC)

It's a general principle that words drop off more quickly than Zipf's law predicts. If we looked at word n-grams I think we would see something closer. Dcoetzee 21:27, 31 October 2008 (UTC)

+ 1 paper related to Wikipedia's Zipf law:

Index wiki database: design and experiments. Krizhanovsky A. A. In: FLINS'08, Corpus Linguistics'08, AIS/CAD'08, 2008. -- AKA MBG (talk) 17:52, 3 November 2008 (UTC)

On "Zipf, Power-laws, and Pareto - a ranking tutorial"[edit]

I've found two doubtful places in the tutorial by L.Adamic (ext. link N3). Probably, I've misread something...

First, "(a = 1.17)" regarding to Fig.1b must be a typo; the slope is clearly -2 or so.

Second, it is not clear whether Fig.2a is a cumulative or disjoint histogram? To the best of my knowledge, logarithmically-disjointly-that-way-binned Zipf must have slope=-1 and not -2. I.e., if every bin catches items which popularity resides in the range [c^{i}:c^{i+1}). Just to verify it, I did a log2-log2 graph of log2-binned word frequencies compiled from Wikipedia. I.e. y is log2 of number of words mentioned 2^x to 2^{x+1}-1 times in the whole Wikipedia. Although the curve is not that simple, it shows slope=-1, especially in regard to more frequent words.

Any thoughts? Gritzko

Yes it looks like a typo, a=1.17 certainly is not right. Regarding Fig 2a, it is cumulative (the vertical axis is "proportion of sites" which will have an itercept at (1,1)). The Zipf law does not specify an exponent of -1, just that it is some negative constant. It happens to be close to -1 for word frequency, but maybe its closer to -2 for aol user data. PAR 14:20, 1 January 2007 (UTC)
"As demonstrated with the AOL data, in the case b = 1, the power-law exponent a = 2.", i.e. b is "close to unity" in the case of AOL user data. I had some doubts whether Fig 2a is cumulative because at x=1 y seems to be slightly less than 1. Probably, it is just a rendering glitch. Thanks! -- Gritzko

k = 0 in support?[edit]

I'm not sure if k=0 should be included in the support... pmf is not well defined there, as 1 / (k^s) = 1 / 0 and it -> +inf. Krzkrz 08:56, 3 May 2007 (UTC)

biographical information[edit]

I was surprised that there wasn't even a brief note at the beginning of the article on who Zipf is (was?). I think it's sort of nice to see that before you get into the technical stuff. Jdrice8 05:38, 14 October 2007 (UTC)

That was the brief second paragraph of the article. Now I've made it into a parenthesis in the first sentence, set off by commas. Michael Hardy 01:58, 15 October 2007 (UTC)

Word length[edit]

I'm sure I've come across in a no. of places the term Zipf's law used to refer to the inverse relation between word length & frequency. This may well be a misuse, but isn't it common enough to be mentioned with a link to whatever the correct term is? Peter jackson (talk) 14:43, 15 September 2008 (UTC)

Same here. I wish the page explained that up front, but I don't know enough to write it. — Preceding unsigned comment added by 76.94.255.117 (talk) 20:48, 14 July 2012 (UTC)

This page supposedly has a quote from Zipf: http://link.springer.com/content/pdf/10.1007%2F978-1-4020-4068-9_13.pdf

“that the magnitude of words tends, on the whole, to stand in an inverse (not necessarily proportionate) relationship to the number of occurrences” — Preceding unsigned comment added by 216.239.45.72 (talk) 19:43, 16 May 2013 (UTC)

Lead[edit]

The lead must be rephrased: defining Zipf's law in terms of a Zipfian distribution is no help to the reader who must actually learn. Srnec (talk) 15:11, 31 October 2008 (UTC) —Preceding unsigned comment added by 81.185.244.190 (talk) 21:28, 8 January 2009 (UTC)

I was just coming by to say the same thing. The law itself should be defined in the lede; defining it in terms of something derived from it does no one any good. It's been three and a half years and no one's done anything about it? 174.57.203.45 (talk) 19:12, 1 June 2012 (UTC)

Nonsense? Err Nevermind[edit]

The article currently states the following:

That Zipfian distributions arise in randomly-generated texts with no linguistic structure suggests that in linguistic contexts, the law may be a statistical artifact.[2]

I don't get it, this sounds like patent non-sense. When one generates a random text, one must choose works with some random frequency or distribution. What distribution is used? The result of random generation should show the same distribution as the random number generator: so if one generates random text using a Zipfian distribution, the result will be Zipfian. So this is a tautology. Surely something different is being implied in citation 2. However, this text, as written, is nonsense. linas (talk) 22:36, 14 March 2009 (UTC)

Ah. The one sentence summary was misleading. I modified the article text so that is made sense. The "random text" is actually a random string, chosen from an alphabet of N letters, with one of the letters being a blank space. The blank space acts as a word separator, and the letter frequencies are uniformly distributed. linas (talk) 22:59, 14 March 2009 (UTC)
As the text is now, it's still (or again?) nonsense. It claims that the Zipfian distribution is a consequence of using letters to spell words. Just a few sentences above, the article says that words in the corpus are in Zipfian distribution. This has nothing to do with how words are encoded, and the distribution would exist regardless of what, if anything, is used to represent them. 193.77.151.137 (talk) 16:07, 5 April 2009 (UTC)

Frequency of words in English[edit]

The article originally stated,

In English, the frequencies of the approximately 1000 most-frequently-used words are approximately proportional to 1/{n^s} where s is just slightly more than one.[citation needed]

This is misleading, not just because the citation is missing, but also because it seems to me that English word frequencies may well follow the maximum entropy principle: their distribution maximises the entropy under the constraint that the probabilities decrease. Zipfs law, for any s>1, does not maximise the entropy, because it is possible to define distributions with probabilities that decrease even more slowly. For example, P(n) = 1/\log(n+1) - 1/\log(n+2) is a very slow one.

I therefore rewrote this bit somehwat. —Preceding unsigned comment added by 131.111.20.201 (talk) 09:52, 1 May 2009 (UTC)

Problems[edit]

I think there is an error in the example log-log graph of Wikipedia word frequency vs. rank. The two different lines of color cyan and blue cannot both represent 1/x^2, as they are labelled. I suspect that one is 1/x^3. Whoever has access to the code that generated the graph, please fix it!!!


—Preceding unsigned comment added by 98.222.58.129 (talk) 04:45, 8 January 2010 (UTC)

I agree the cyan plot is wrong but I don't think it can be 1/x3 since it nas the same slope as 1/x2. It is more like 10110/x2. I am going to ping the creator on this - it definitely needs fixing or removing. SpinningSpark 17:01, 4 June 2011 (UTC)
Actually, all the guide plots are up the creek: they should all be going through the point (1,1). I suspect the author means \mathcal{O} f(x) rather than  \sim f(x) SpinningSpark 17:22, 4 June 2011 (UTC)

The section entitled "Statistical Explanation" is clearly nonsense. Equal probability for selecting each letter means that the top of the rank order will be 26 equally probable single letter words. "b" will not occur twice as often as "a". What is Wentian Li talking about??? Clearly it is not what is said here.


—Preceding unsigned comment added by 98.222.58.129 (talk) 04:56, 8 January 2010 (UTC)

citation needed[edit]

I don't know how to add a citation needed link. It's needed for the parenthetical, (although the law holds much more strongly to natural language than to random texts)

sbump (talk) 21:42, 19 March 2010 (UTC)

type {{citation needed}}, it will look like this: [citation needed]. Cheers, — sligocki (talk) 03:21, 21 March 2010 (UTC)

Redundancy[edit]

Although the Shannon-Weaver entropy is listed in the table here, I think it is of fundamentally more importance and instructional value to list the REDUNDANCY which, in Zipf's law (s=1) is CONSTANT (and slightly less than 3). Credit to Heinz von Foerster for deriving this from fundamental principles.

98.222.58.129 (talk) 14:04, 19 February 2010 (UTC)