Birthday problem

The birthday paradox states that if there are 23 people in a room then there is a chance of more than 50% that at least two of them will have the same birthday. This means that in a typically-sized school class, where the 'paradox' is often cited, an even higher probability often applies. For 60 or more people, the probability is already greater than 99%. This is not a paradox in the sense of leading to a logical contradiction; it is a paradox in the sense that it is a mathematical truth that contradicts common intuition. Most people estimate that the chance is much lower than 50:50. Calculating this probability (and related ones) is the birthday problem. The mathematics behind it has been used to devise a well-known cryptographic attack named the birthday attack.

Understanding the paradox

The key to understanding the birthday paradox is to realize that there are many possible pairs of people whose birthdays could match. Specifically, among 23 people, there are C(23,2) = 23 × 22/2 = 253 pairs, each of which being a potential candidate for a match. Looked at in this way, it doesn't seem that unlikely that one of these 253 pairs yields a match.

To emphasize the point, consider a different scenario: if you enter a room with 22 other people, the chance that somebody there has the same birthday as you is not 50:50, but much lower. This is because now there are only 22 possible pairs that could yield a match. The actual birthday problem is asking if any of the 23 people have a matching birthday with any of the others.

Calculating the probability

To compute the approximate probability that in a room of n people, at least two have the same birthday, we disregard distribution variations, such as leap years, twins, seasonal or weekday variations, and assume that d = 365 possible birthdays are equally likely. Real-life birthday distributions are not uniform since not all dates are equally likely.

Note that keeping d generic allows us to find the (approximate) solution to a more general problem: given n random integers drawn from a discrete uniform distribution with range [1,d], what is the probability that at least two numbers are the same?

The trick is to first calculate the probability p(n;d) that all n birthdays are different. If n > d, by the pigeonhole principle this probability is 0%. On the other hand, if n ≤ d, it is given by

{\bar {p}}(n;d)=1\cdot \left(1-{\frac {1}{d}}\right)\cdot \left(1-{\frac {2}{d}}\right)\cdots \left(1-{\frac {n-1}{d}}\right)=\prod _{k=1}^{n-1}\left(1-{k \over d}\right),

because the second person cannot have the same birthday as the first (364/365), the third cannot have the same birthday as the first two (363/365), etc.

Using the Taylor series expansion (some may say the definition) of the exponential function

e^{x}=1+x+{\frac {x^{2}}{2!}}+\cdots

the above expression can be approximated as

{\bar {p}}(n;d)\approx 1\cdot e^{-1/d}\cdot e^{-2/d}\cdots e^{-(n-1)/d}

=1\cdot e^{-(1+2+\cdots +(n-1))/d}

=e^{-(n(n-1))/2d}

The event of at least two of the n persons having the same birthday is complementary to all n birthdays being different. Therefore, its probability p(n;d) is

p(n;d)=1-{\bar {p}}(n;d)\approx 1-e^{-(n(n-1))/2d}

Substituting n = 23 and d = 365 gives a probability of about 50.7%. The following table shows other probabilities, numerically computed using the formula above, for d = 365:

n	p(n;365)
10	12%
20	41%
30	70%
50	97%
100	99.99996%
200	99.9999999999999999999999999998%
300	1 − (7 × 10⁻⁷³)
350	1 − (3 × 10⁻¹³¹)
≥366	100%

Accuracy of the approximation

An even coarser approximation to the answer is given by

p(n;d)\approx 1-e^{-n^{2}/2d},\,

which, as the graph illustrates, is still fairly accurate for d = 365.

Same birthday as you

Comparing p(n;365) = probability of a birthday match with q(n;365) = probability of matching *your* birthday

Note that in the birthday problem, neither of the two people is chosen in advance. By way of contrast, the probability q(n;d) that someone in a room of n other people has the same birthday as a particular person (for example, you), or more generally has picked the same number between 1 and d as you, is given by

q(n;d)=1-\left({\frac {d-1}{d}}\right)^{n}

Substituting n = 22 and d = 365 gives about 5.9%, which is only slightly better than 1 chance in 17. For a greater than 50:50 chance that one person in a roomful of n people has the same birthday as you, n would need to be at least 253. Note that this number is significantly higher than 365/2 = 182.5: the reason is that there are likely some birthday matches among the people in the room.

Reverse problem

An alternate question may be:

For a fixed probability p and number of days in a year d...

... find the greatest n(p;d) for which the probability p(n;d) is smaller than the given p, or

... find the smallest n(p;d) for which the probability p(n;d) is greater than the given p.

An approximation for this is given by:

n(p;d)\approx \left(2d\ln \left({1 \over 1-p}\right)\right)^{1/2}.

Example

approximation			computation for d := 365
p	n generalized	n for d := 365	n↓	p(n↓)	n↑	p(n↑)
0.01	0.14178 √d	2.70864	2	0.00274	3	0.00820
0.05	0.32029 √d	6.11916	6	0.04046	7	0.05624
0.1	0.45904 √d	8.77002	8	0.07434	9	0.09462
0.2	0.66805 √d	12.76302	12	0.16702	13	0.19441
0.3	0.84460 √d	16.13607	16	0.28360	17	0.31501
0.5	1.17741 √d	22.49439	22	0.47570	23	0.50730
0.7	1.55176 √d	29.64625	29	0.68097	30	0.70632
0.8	1.79412 √d	34.27666	34	0.79532	35	0.81438
0.9	2.14597 √d	40.99862	40	0.89123	41	0.90315
0.95	2.44775 √d	46.76414	46	0.94825	47	0.95477
0.99	3.03485 √d	57.98081	57	0.99012	58	0.99166

Note: some values are coloured showing that the approximation is not always exact.

Implications of inequalities

For variations of the birthday scenario in broader contexts, a different flavor of argument is essential. The argument below, exploiting important inequalities, is adapted from an argument of Paul Halmos(refactored from Halmos).

The probability of coincident birthdays, p(n), is one minus the probability that no two birthdays coincide, 1 − p(n). The usual argument given above says that p(n) is the product

\prod _{k=1}^{n-1}\left(1-{k \over 365}\right).

We are interested in the smallest n such that p(n) > 1/2; or equivalently, the smallest n such that p(n), shown here, is less than 1/2. The general idea is to repeatedly replace the complicated product by simpler expressions, each of which is no smaller in value. If our final simple expression is less than 1/2, then the true value must be also. Results obtained this way may be overly conservative, in the sense that a smaller value of n might suffice; but they are always safe.

Because of the inequality of arithmetic and geometric means, we have

{\sqrt[{n-1}]{\prod _{k=1}^{n-1}\left(1-{k \over 365}\right)}}<{1 \over n-1}\sum _{k=1}^{n-1}\left(1-{k \over 365}\right).

Here the left side is the geometric mean (root of a product) and the right side is the arithmetic mean (division of a sum). Neither side is greater than one, so raising both sides to the power n−1 does not change the inequality. Thus our first replacement is

\prod _{k=1}^{n-1}\left(1-{k \over 365}\right)<\left({1 \over n-1}\sum _{k=1}^{n-1}\left(1-{k \over 365}\right)\right)^{n-1}.

That is, we substitute the sum raised to n−1 for the product. The sum splits into (∑ 1) − (∑k)/365; and since the first is a constant (summing to n−1) and the second an arithmetic progression (summing to n(n − 1)/2), we can replace the sum by an exact expression:

\left({1 \over n-1}\sum _{k=1}^{n-1}\left(1-{k \over 365}\right)\right)^{n-1}=\left(1-{n \over 730}\right)^{n-1}.

Our next substitution uses the inequality 1 − x < e^−x. Thus we have

\left(1-{n \over 730}\right)^{n-1}<\left(e^{-n/730}\right)^{n-1}.

By the usual laws of exponents, (e^a)^b = e^ab; so we can simplify again:

\left(e^{-n/730}\right)^{n-1}=e^{-(n^{2}-n)/730}.

Noting that e^−x = 1/e^x, we see that e^−x < 1/2 is the same as e^x > 2 (reciprocating reverses the inequality). We can take logarithms of both sides without changing the ordering; thus our original inequality demand, that the probability product be less than 1/2, finally simplifies to the much more manageable

n^{2}-n>730\log 2\,\!.

(Here "log" refers to the natural logarithm.) Now 730 log 2 is approximately 505.997, which is barely below 506, the value of n² − n attained when n = 23. Therefore, 23 people suffice.

Note that Halmos' derivation only shows that at most 23 people are needed to ensure a birthday match with even chance; since we haven't studied how sharp the given inequalities are, the argument leaves open the possibility that, say, n = 22 could also work.

Empirical test

days := 365
numPeople := 1
prob := 0.0
while prob < 0.5 {
    numPeople := numPeople + 1
    prob := 1 - ((1-prob) * (days-(numPeople-1)) / days)
    print "Number of people: " + numPeople
    print "Prob. of same birthday: " + prob
}

Applications

The birthday paradox in its more generic sense applies to hash functions: the number of N-bit hashes that can be generated before probably getting a collision is not 2^N (this is the probability that a specific hash gets repeated), but rather only 2^N/2 (this is the probability that any 2 generated hash values are the same). This is exploited by birthday attacks on cryptographic hash functions.

The theory behind the birthday problem was used in [Schnabel 1938] under the name of capture-recapture statistics to estimate the size of fish population in lakes.

Unequal probabilities

As mentioned above, real-world birthday data are not equally distributed. The birthday problem for such non-constant birthday probabilities was tackled in [Klamkin 1967].

Near matches

Another generalization is to ask how many people are needed in order to have a better than 50% chance that two people have a birthday within one day of each other, or within two, three, etc., days of each other. This is a more difficult problem and requires use of the inclusion-exclusion principle. The results (assuming an equal distribution for birthdays) are just as surprising as in the standard birthday problem:

within k days	#people required
0	23
1	14
2	11
3	9
4	8
5	7
7	6

Thus in a family with six members, it is more likely than not that two members will have a birthday within a week of each other.

References

Zoe Emily Schnabel: "The estimation of the total fish population of a lake", American Mathematical Monthly 45 (1938), pages 348-352
M. Klamkin and D. Newman: "Extensions of the birthday surprise", Journal of Combinatorial Theory 3 (1967), pages 279-282.

Note

Template:Ent In his autobiography, Halmos deplored the fact that the birthday paradox is often presented in terms of numerical computation rather than more abstract concepts. He wrote: The reasoning is based on important tools that all students of mathematics should have ready access to. The birthday problem used to be a splendid illustration of the advantages of pure thought over mechanical manipulation; the inequalities can be obtained in a minute or two, whereas the multiplications would take much longer, and be much more subject to error, whether the instrument is a pencil or an old-fashioned desk computer. What calculators do not yield is understanding, or mathematical facility, or a solid basis for more advanced, generalized theories.

External links